Method and/or apparatus of oligonucleotide design and/or nucleic acid detection

ABSTRACT

A method of designing at least one oligonucleotide for nucleic acid detection including: (I) identifying and/of selecting region(s) of at least one target nucleic acid to be amplified, the region(s) having an efficiency of amplification (AE) higher than the average AE; and (II) designing at least one oligonucleotide capable of hybridizing to the selected region(s). Also, a method of detecting at least one target nucleic acid including: (i) providing at least one biological sample; (ii) amplifying nucleic acid(s) in the biological sample; (iii) providing at least one oligonucleotide capable of hybridizing to at least one target nucleic acid, if present in the biological sample; and (iv) contacting the oligonucleotide(s) with the amplified nucleic acids and detecting the oligonucleotide(s) hybridized to the target nucleic acid(s). The method is useful for detecting the presence of at least one pathogen, such as a virus, in a human biological sample.

This is a continuation of application Ser. No. 11/990,290, filed on Feb.11, 2008, which is a 371 of International Application No.PCT/SG2006/000224, filed on Aug. 8, 2006 and a continuation-in-part ofprior application Ser. No. 11/202,023, filed on Aug. 12, 2005, nowabandoned.

FIELD OF THE INVENTION

The present invention relates to the field of oligonucleotide designand/or nucleic acid detection. The method, apparatus and/or productaccording to the invention may be used for the detection of pathogens,for example for the detection of viruses.

BACKGROUND OF THE INVENTION

The accurate and rapid detection of viral and bacterial pathogens inhuman patients and populations is of critical medical and epidemiologicimportance. Historically, diagnostic techniques have relied on cellculture passaging and various immunological assays or stainingprocedures. Accurate and sensitive detection of infectious diseaseagents is still difficult today, despite a long history of progress inthis area. Traditional methods of culture and antibody-based detectionstill play a central role in microbiological laboratories despite theproblems of the delay between disease presentation and diagnosis, andthe limited number of organisms that can be detected by theseapproaches. Faster diagnosis of infections would reduce morbidity andmortality, for example, through the earlier implementation ofappropriate antimicrobial treatment. During the past few decades,various methods have been proposed to achieve this; with those based onnucleic acid detection, including PCR and microarray-based techniques,seeming the most promising. In particular, PCR-based assays have beenimplemented, allowing for more rapid diagnosis of suspected pathogenswith higher degree of sensitivity of detection. In clinical practice,however, the etiologic agent often remains unidentified, eludingdetection in myriad ways. For example, some viruses are not amenable toculturing. At other times, a patient's sample may be of too poor qualityor of insufficient titre for pathogen detection by conventionaltechniques. Moreover, both PCR- and antibody-based approaches may failto recognize suspected pathogens simply due to natural geneticdiversification resulting in alterations of PCR primer binding sites andantigenic drift.

DNA and oligonucleotide microarrays with the potential to detectmultiple pathogens in parallel have been described (Wang et al. 2002;Urisman et al. 2005). However, unresolved technical questions preventtheir routine use in the clinical setting. For example, how does oneselect the most informative probes for comprising a pathogen “signature”in light of amplification and cross-hybridization artifacts? What levelsof fluorescent signal and signature probe involvement constitute adetected pathogen? What is the accuracy and sensitivity of an optimizeddetection algorithm? (Striebel et al. 2003; Bodrossy and Sessitsch,2004; Vora et al. 2004).

Accordingly, there is a need in this field of technology for alternativeand improved methods of detection of nucleic acids. In particular, thereis a need for alternative and/or improved diagnostic methods for thedetection of pathogens.

SUMMARY OF THE INVENTION

The present invention addresses the problems above, and in particularprovides a method, apparatus and/or product of oligonucleotide design.In particular, there is provided a method, apparatus and/or product ofoligonucleotide probe and/or primer design. There is also provided amethod, apparatus and/or product of nucleic acid detection.

According to a first aspect, the present invention provides a method ofdesigning at least one oligonucleotide for nucleic acid detectioncomprising the following steps in any order:

-   -   (I) identifying and/or selecting at least one region of at least        one target nucleic acid to be amplified, the region(s) having an        efficiency of amplification (AE) higher than the average AE; and    -   (II) designing at least one oligonucleotide capable of        hybridizing to the selected region(s).

The at least one oligonucleotide may be at least one probe and/orprimer.

In particular, in step (I) a score of AE is determined for everyposition i on the length of the target nucleic acid(s) or of at leastone region thereof and subsequently, an average AE score is obtained.Those regions showing an AE score higher than the average may beselected as the region(s) of the target nucleic acid to be amplified. Inparticular, the AE of the selected region(s) may be calculated as theAmplification Efficiency. Score (AES), which is the probability that aforward primer r_(i) can bind to a position i and a reverse primer r_(j)can bind at a position j of the target nucleic acid, and |i−j| is theregion of the target nucleic acid desired to be amplified. Inparticular, the region |i−j| may be 10000 bp, more in particular ≦5000bp, or ≦1000 bp, for example ≦500 bp. In particular, the forward andreverse primers may be random primers.

According to another aspect, the step (I) comprises determining theeffect of geometrical amplification bias for every position of a targetnucleic acid, and selecting at least one region(s) to be amplified asthe region(s) having an efficiency of amplification (AE) higher than theaverage AE. For example, the geometrical amplification bias is the PCRbias.

The step (II) of designing at least one oligonucleotide capable ofhybridizing to the region(s) selected in step (I) may be carried outaccording to any oligonucleotide designing technique known in the art.In particular, the oligonucleotide(s) capable of hybridizing to theselected region(s) may be selected and designed according to at leastone of the following criteria:

-   -   (a) the selected oligonucleotide(s) has a CG-content from 40% to        60%;    -   (b) the oligonucleotide(s) is selected by having the highest        free energy computed based on Nearest-Neighbor model;    -   (c) given oligoncleotide s_(a) and oligonucleotide s_(b)        substrings of target nucleic acids v_(a) and v_(b), s_(a) is        selected based on the hamming distance between s_(a) and any        length-m substring s_(b) and/or on the longest common substring        of s_(a) and oligonucleotide s_(b);    -   (d) for any oligonucleotide s_(a) of length-m specific for the        target nucleic acid v_(a), the oligonucleotide s_(a) is selected        if it does not have any hits with any region of a nucleic acid        different from the target nucleic acid, and if the        oligonucleotide s_(a) length-m has hits with the nucleic acid        different from the target nucleic acid, the oligonucleotide        s_(a) length-m with the smallest maximum alignment length and/or        with the least number of hits is selected; and    -   (e) a oligonucleotide p_(i) at position i of a target nucleic        acid is selected if p_(i) is predicted to hybridize to the        position i of the amplified target nucleic acid.

In particular, the oligonucleotide may be a probe and/or primer.

Accordingly, two or more of the criteria indicated above may be used fordesigning the oligonucleotide(s). For example, the oligonucleotide(s)may be designed by applying all criteria (a) to (e). Other criteria notexplicitly mentioned herein but which are within the knowledge of askilled person in the art may also be used.

In particular, under the criterion (e), a oligonucleotide p_(i) atposition i of a target nucleic acid v_(a) is selected ifP(p_(i)|v_(a))>λ, wherein λ is 0.5 and P(p_(i)|v_(a)) is the probabilitythat p_(i) hybridizes to the position i of the target nucleic acidv_(a). More in particular, λ is 0.8.

In particular,

${{{P\left( p_{i} \middle| v_{a} \right)} \approx {P\left( {X \leqq x_{i}} \right)}} = \frac{c_{i}}{k}},$

wherein X is the random variable representing the amplificationefficiency score (AES) values of all oligonucleotides of v_(a), k is thenumber of oligonucleotides in v_(a), and c_(i) is the number ofoligonucleotides whose AES values are ≦x_(i).

According to another aspect of the invention, the method of designingthe oligonucleotide(s) as described above further comprises a step ofpreparing the selected and designed oligonucleotide(s). Theoligonucleotide, which may be at least one probe and/or primer, may beprepared according to any standard method known in the art. For example,by chemical synthesis or photolithography.

According to another aspect, the present invention provides a method ofdetecting at least one target nucleic acid comprising the steps of:

-   -   (i) providing at least one biological sample;    -   (ii) amplifying nucleic acid(s) comprised in the biological        sample;    -   (iii) providing at least one oligonucleotide capable of        hybridising to at least one target nucleic acid, if present in        the biological sample, wherein the oligonucleotide(s) is        designed and/or prepared by using a method according to any        aspect of the invention described herein; and    -   (iv) contacting the oligonucleotide(s) with the amplified        nucleic acids and/or detecting the oligonucleotide(s) hybridised        to the target nucleic acid(s).

In particular, the oligonucleotide is a probe.

The amplification step (ii) may be carried out in the presence of randomprimers. For example, the amplification step (ii) may be carried out inthe presence of at least one random forward primer, at least one randomreverse primer and/or more than two random primers. Any amplificationmethod known in the art may be used. For example, the amplificationmethod is a RT-PCR.

In particular, a forward random primer binding to position i and areverse random primer binding to position j of a target nucleic acidv_(a) are selected among primers having an amplification efficiencyscore (AES_(i)) for every position i of a target nucleic acid v_(a) of:

${{AES}_{i} = {\sum\limits_{j = {i - Z}}^{i}\; \left\{ {{P^{f}(j)} \cdot {\sum\limits_{k = {\max {({{i + 1},{j + 500}})}}}^{j + Z}\; {P^{r}(k)}}} \right\}}},{{{wherein}\mspace{14mu} {\sum\limits_{k = {\max {({{i + 1},{j + 500}})}}}^{j + Z}\; {P^{r}(k)}}} = {{P^{r}\left( {i + 1} \right)} + {P^{r}\left( {i + 2} \right)} + {{\ldots P}^{r}\left( {j + Z} \right)}}},$

P^(f) (i) and P^(r) (i) are the probability that a random primer r_(i)can bind to position i of v_(a) as forward primer, and reverse primer,respectively, and Z≦10000 bp is the region of v_(a) desired to beamplified. More in particular, Z may be ≦5000 bp, ≦1000 bp, or ≦500 bp.

The amplification step may comprise forward and reverse primers, andeach of the forward and reverse primers may comprise, in a 5′-3′orientation, a fixed primer header and a variable primer tail, andwherein at least the variable tail hybridizes to a portion of the targetnucleic acid v_(a). In particular, the amplification step may compriseforward and/or reverse random primers having the nucleotide sequence ofSEQ ID NO:1 or a variant or derivative thereof.

The biological sample may be any sample taken from a mammal, for examplefrom a human being. The biological sample may be tissue, sera, nasalpharyngeal washes, saliva, any other body fluid, blood, urine, stool,and the like. The biological sample may be treated to free the nucleicacid comprised in the biological sample before carrying out theamplification step. The target nucleic acid may be any nucleic acidwhich is intended to be detected. The target nucleic acid to be detectedmay be at least a nucleic acid exogenous to the nucleic acid of thebiological sample. Accordingly, if the biological sample is from ahuman, the exogenous target nucleic acid to be detected (if present inthe biological sample) is a nucleic acid which is not from human origin.According to an aspect of the invention, the target nucleic acid to bedetected is at least a pathogen genome or fragment thereof. The pathogennucleic acid may be at least a nucleic acid from a virus, a parasite, orbacterium, or a fragment thereof.

Accordingly, the invention provides a method of detection of at least atarget nucleic acid, if present, in a biological sample. The method maybe a diagnostic method for the detection of the presence of a pathogenin the biological sample. For example, if the biological sample isobtained from a human being, the target nucleic acid, if present in thebiological sample, is not from human.

The oligonucleotide(s) designed and/or prepared according to any methodof the present invention may be used in solution or may be placed on aninsoluble support. For example, the oligonucleotide probe(s) may beapplied, spotted or printed on an insoluble support according to anytechnique known in the art. The support may be a microarray, a biochip,a membrane/synthetic surface, solid support or a gel.

The probes are then contacted with the nucleic acid(s) of the biologicalsample, and, if present, the target nucleic acid(s) and the probe(s)hybridize, and the presence of the target nucleic acid is detected. Inparticular, in the detection step (iv), the mean of the signalintensities of the probes which hybridize to v_(a) is statisticallyhigher than the mean of the probes ∉v_(a), thereby indicating thepresence of v_(a) in the biological sample.

More in particular, in the detection step (iv), the mean of the signalintensities of the probes which hybridize to v_(a) is statisticallyhigher than the mean of the probes ∉v_(a), and the method furthercomprises the step of computing the relative difference of theproportion of probes ∉v_(a) having high signal intensities to theproportion of the probes used in the detection method having high signalintensities, the density distribution of the signal intensities ofprobes v_(a) being more positively skewed than that of probes ∉v_(a),thereby indicating the presence of v_(a) in the biological sample.

For example, in the detection step (iv), at least one target nucleicacid in a biological sample is detected if the density distribution ofits probe signal intensities is not normal, i.e. more positively skewed,given by Anderson-Darling test value ≦0.05 and/or a value of t-tests≦0.1 and/or a value of Weighted Kullback-Leibler divergence of ≧1.0,preferably ≧5.0., In particular, the t-test value is ≦0.05.

More in particular, the method of the detection step (iv), furthercomprises evaluating the probe signal intensity of probe(s) in eachpathogen specific signature probe set (SPS) for the target nucleicacid(s) v_(a) by calculating the distribution of WeightedKullback-Leibler (WKL) divergence scores:

${{WKL}\left( P_{a} \middle| \overset{\_}{P_{a}} \right)} = {\sum\limits_{j = 0}^{k - 1}\; \frac{{Q_{a}(j)}{\log \left( \frac{Q_{a}(j)}{Q_{\overset{\_}{a}}(j)} \right)}}{\sqrt{{Q_{\overset{\_}{a}}(j)}\left\lbrack {1 - {Q_{\overset{\_}{a}}(j)}} \right\rbrack}}}$

where Q_(a)(j) is the cumulative distribution function of the signalintensities of the probes in P_(a) found in bin b_(j); Q_(ā)(j) is thecumulative distribution function of the signal intensities of the probesin P_(a) found in bin b_(j). Q_(ā)(j) is the cumulative distributionfunction of the signal intensities of the probes in P_(a) found in binb_(j). P_(a) is the set of probes of a virus v_(a) and P_(a) =P−P_(a).

For example, each signature probe set (SPS) which represents the absenceof target nucleic acid(s) v_(a) has a normally distributed signalintensity (assessed by Anderson-Darling test value ≦0.05) and/or aWeighted Kullback-Leibler (WKL) divergence score of WKL<5. Eachsignature probe set (SPS) which represents the presence of at least onetarget nucleic acid v_(a) has a positively skewed signal intensitydistribution and/or a Weighted Kullback-Leibler (WKL) divergence scoreof WKL>5.

The method may further comprise performing Anderson-Darling test on thedistribution of WKL score(s), wherein a result of P>0.05 therebyindicates the absence of target nucleic acid(s) v_(a), or wherein aresult of P<0.05 thereby indicates the presence of target nucleicacid(s) v_(a). Additionally, a further Anderson-Darling test may beperformed thereby indicating the presence of further co-infecting targetnucleic acid(s). According to another aspect, the present inventionprovides a method of determining the presence of a target nucleic acidv_(a) comprising detecting the hybridization of at least oneoligonucleotide probe (the probe being selected and designed accordingto any known method in the art and not necessary limited to the methodsaccording to the present invention) to at least one target nucleic acidv_(a) and wherein the mean of the signal intensities of the probe(s)which hybridize to v_(a) is statistically higher than the mean of theprobes v_(s), thereby indicating the presence of v_(a). In particular,the mean of the signal intensities of the probes which hybridize tov_(a) is statistically higher than the mean of the probes ∉v_(a), andthe method further comprises the step of computing the relativedifference of the proportion of probes ∉v_(a) having high signalintensities to the proportion of the probes used in the detection methodhaving high signal intensities, the density distribution of the signalintensities of probes v_(a) being more positively skewed than that ofprobes ∉v_(a), thereby indicating the presence of v_(a). More inparticular, the presence of a target nucleic acid in a biological sampleis given by a value of t-test ≦0.1 and/or Anderson-Darling test value≦0.05 and/or a value of Weighted Kullback-Leibler divergence of ≧1.0,preferably ≧5.0. For example, the t-test value may be ≦0.05.

According to another aspect, the present invention provides a method ofdetecting at least one target nucleic acid, comprising the steps of:

-   -   (i) providing at least one biological sample;    -   (ii) amplifying at least one nucleic acid(s) comprised in the        biological sample;    -   (iii) providing at least one oligonucleotide capable of        hybridizing to at least one target nucleic acid, if present in        the biological sample; and    -   (iv) contacting the oligonucleotide(s) with the amplified        nucleic acids and detecting the oligonucleotide(s) hybridized to        the target nucleic acid(s), wherein the mean of the signal        intensities of the oligonucleotide(s) which hybridize to v_(a)        is statistically higher than the mean of the oligonucleotide(s)        E v_(a), thereby indicating the presence of v_(a) in the        biological sample.

In particular, the oligonucleotide is an oligonucleotide probe.

In step (iv), the mean of the signal intensities of the probes whichhybridize to v_(a) is statistically higher than the mean of the probes∉v_(a), and the method further comprises the step of computing therelative difference of the proportion of probes ∉v_(a) having highsignal intensities to the proportion of the probes used in the detectionmethod having high signal intensities, the density distribution of thesignal intensities of probes v_(a) being more positively skewed thanthat of probes ∉v_(a), thereby indicating the presence of v_(a) in thebiological sample. In particular, in step (iv) the presence of at leastone target nucleic acid in a biological sample is given by a value oft-tests ≦0.1 and/or Anderson-Darling test value ≦0.05 and/or a value ofWeighted Kullback-Leibler divergence of ≧1.0, preferably ≧5.0. Thet-test value may be ≦0.05. The nucleic acid to be detected is nucleicacid exogenous to the nucleic acid of the biological sample. The targetnucleic acid to be detected may be at least one pathogen genome orfragment thereof. The pathogen nucleic acid may be at least one nucleicacid from a virus, a parasite, or bacterium, or a fragment thereof. Inparticular, when the sample is obtained from a human being, the targetnucleic acid, if present in the biological sample, is not from the humangenome. The probes may be placed on an insoluble support. The supportmay be a microarray, a biochip, or a membrane/synthetic surface.

The present invention provides an apparatus of the invention, comprisingan apparatus for performing the methods according to the invention. Inparticular, the apparatus may be for designing oligonucleotide(s) fornucleic acid detection and/or amplification, the apparatus beingconfigured to identify and/or select at least one region(s) of at leastone target nucleic acid to be amplified, the region(s) having anefficiency of amplification (AE) higher than the average AE; and designat least one oligonucleotide(s) capable of hybridizing to the identifiedand/or selected region(s). More in particular, the apparatus may beconfigured to detect at least one target nucleic acid comprising any oneof the steps of: providing at least one biological sample; amplifyingnucleic acid(s) comprised in the biological sample; providing at leastone oligonucleotide capable of hybridizing to at least one targetnucleic acid, if present in the biological sample, wherein theoligonucleotide(s) is designed and/or prepared according to theapparatus being configured according to the invention; and contactingthe oligonucleotide(s) with the amplified nucleic acids and/or detectingthe oligonucleotide(s) hybridized to the target nucleic acid(s).

The present invention also provides at least one computer programproduct configured for performing the method according to the invention.There is also provided at least one electronic storage medium storingthe configuration of the apparatus according to the invention. Accordingto one aspect, the invention provides a removable electronic storagemedium comprising a software configured to perform the method(s)according to the invention. In particular, the removable electronicstorage medium may comprise a software configured to determine the WKLdivergence score and/or Anderson-Darling test for designing at least oneoligonucleotide probe and/or primer, and/or detecting at least onetarget nucleic acid. More in particular, the removable electronicstorage comprising a software configuration may comprise the WKL,Anderson-Darling test, the designing of probe(s) and/or the detecting oftarget nucleic acid(s) as defined according to the invention.Accordingly, there is also provided a software configured as describedabove.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 shows a RT-PCR binding process of a pair of random primers on avirus sequence (SEQ ID NOS: 1 to 9). The labels for FIG. 1 are asfollows:

A: Reverse transcription (RT). Primer binds to template.B: Tagged RT products are generated (in detail with hypothetical viralsequence template, and hypothetical specific random primer).C: Second strand synthesis is completed incorporating tags.D: Amplification of tagged RT product using PCR Primer GTTTCCCAGTCACGATA(SEQ ID NO:8).

FIG. 2 shows an Amplification Efficiency Scoring (AES) Map for the RSV Bgenome.

FIG. 3 shows oligonucleotide probe signal intensities for 1 experimentfor RSV B.

FIGS. 4(A, B). FIG. 4A shows the density distribution of signalintensities of a virus that is the sample tested. An arrow indicates thepositive skewness of the distribution. This indicates that althoughthere is noise, there is significant amount of real signals as well.FIG. 4B shows the density distribution of signal intensities of a virusnot in the sample. It is noise dominant.

FIG. 5 shows an analysis framework of pathogen detection chip data.

FIG. 6. Oligonucleotide probe design schema. This illustrates the tilingprobes created across the genome of NC_(—)001781 Human respiratorysyncytial virus (RSV). The numbers represent the start and end positionsof each probe. 1948 probes were synthesized to cover the entire 15225 bpRSV genome. This process was repeated for the remaining 34 viralgenomes.

FIG. 7(A,B,C) Key to labels of microarray bars:

Virus Virus family genus/species Orthomyxoviridae Sars Sin2500 OC41 229ECoronaviridae Flu A Flu B Picornaviridae Entero D Entero C Echo 1 EnteroB Entero A Rhino 89 Rhino B Hep A Foot & mouth C Bunyaviridae HantaanSin Nombre Flaviviridae West Nile Jap enceph Dengue 3 Dengue 1 Dengue 2Dengue 4 Yellow fever Paramyxoviridae Paraflu 1 Paraflu 3 Nipah Paraflu2 Newcastle RSV (B1) Metapneumovirus Others HPV type 10 HIV 1 Hep BRubella LCMV-S LCMV-L PMMV Human controls

RNA isolated from SARS Sin850-infected cell line (A) or DengueI-infected cell line (B) was hybridized onto the pathogen microarrayfollowing SARS-specific or Dengue I-specific RT-PCR respectively. SARScross-hybridized (shown in black colour) to other coronaviridae genomes,particularly to the highly conserved middle portion of the genome (Ruanet al. 2003). Dengue I cross-hybridized to probes derived fromflaviviridae and other genomes based on their sequence similarity. Byexamining the Hamming Distance (HD) and Maximum Contiguous Match (MCM)scores, we established thresholds to predict whether cross-hybridizationwould occur and utilized this information to generate in silicohybridization signatures. (C) RNA isolated from a clinical patientdiagnosed with RSV was amplified using random RT-PCR and hybridized ontothe pathogen microarray.

FIG. 8 Relationship between probe Hamming Distance (HD), probe MaximumContiguous Match (MCM) and probe Signal Intensity. Average probe signalintensity decreases as HD increases and MCM decreases. This correlateswith a reduction in the percentage of detectable probes (signalintensity>mean+2 SD). At the optimal cross-hybridization thresholds HD≦4or MCM≦18 (shaded), >98% of probes can be detected. At HD=5 or MCM=17,the detection rate falls to 85%.

FIG. 9(A, B) RNA isolated from a RSV-infected patient was hybridizedonto a pathogen detection array. (A) Distribution of probe signalintensities all 53,555 probes show a normal distribution (grey solidline). Non-RSV probes, when examined on a genome-specific level, e.g.parainfluenza-1 (grey dotted line), also show a normal distribution.Signal intensity of RSV-specific probes have a positive skew, withhigher signal intensities in the tail of the distribution (black solidline). (B) Distribution frequency of WKL scores for the 35 SPS withmajority ranging between −5 and 3. However the WKL score for the RSVgenome is 17, so the distribution is not normal (P<0.05 by AndersonDarling test). Excluding the outlier genome results in a normaldistribution. From this computation, we conclude that RSV is present inthe hybridized sample.

FIG. 10 AES is indicative of probe amplification efficiency. Higherproportion of probes with high AES are detectable above signal intensitythresholds over 5 experiments.

FIG. 11 Schema showing the processes necessary for pathogen detectionusing microarray.

FIG. 12 Hybridization signal intensity correlates to AmplificationEfficiency Score (AES), P=2.2×10⁻¹⁶. A RSV patient sample was hybridizedonto a microarray, and signal intensities of each probe were plottedtogether with the computed AES. The signal threshold for high-confidencedetection on a typical array is indicated by the green line.

FIG. 13 Using AES-optimized primer tags for random RT-PCR increases theAES by 10-30-fold. The optimized primers were predicted to have the sameperformance across all 35 genomes represented on the microarray. Mostpatient samples were amplified using the AES-optimized primer A2.

SEQ ID NO: Primer Nucleotide sequence 10 A1 GTTTCCCAGTCACGATA 11 A2GATGAGGGAAGATGGGG 12 A3 CTCATGCACGACCCAAA 13 A4 AGATCCATTCCACCCCA

FIG. 14(A,B) Key to labels of microarray bars:

Virus family Virus genus/species Orthomyxoviridae Sars Sin2500 OC41 229ECoronaviridae Flu A Flu B Picornaviridae Entero D Entero C Echo 1 EnteroB Entero A Rhino 89 Rhino B Hep A Foot & mouth C Bunyaviridae HantaanSin Nombre Flaviviridae West Nile Jap enceph Dengue 3 Dengue 1 Dengue 2Dengue 4 Yellow fever Paramyxoviridae Paraflu 1 Paraflu 3 Nipah Paraflu2 Newcastle RSV (B1) Metapneumovirus Others HPV type 10 HIV 1 Hep BRubella LCMV-S LCMV-L PMMV Human controlsChoice of primer tag in random RT-PCR has significant effect on PCRefficiency. Heatmap showing probes hybridizing to a clinical hMPV samplefollowing RT-PCR using (A) original primer described by Bohlander, et.al. 1992, or (B) primer designed following PCR modeling to ensure thatit will efficiently amplify all genomes (high AES) represented on themicroarray.

FIG. 15 Diagnostic PCR results for RSV Patient #412 confirm that patientdoes not have a coronavirus infection. (A) PCR using Pancoronavirusprimers. Lane 1: OC43 coronavirus positive control, Lane 2: 229Ecoronavirus positive control, Lane 3: RSV patient #412, Lane 4: PCRprimers and reagents only negative control. 1 kb ladder. (B) PCR usingOC43 specific primers. Lane 1: OC43 coronavirus positive control, Lane2: RSV patient #412, Lane 3: purified RSV from ATCC, Lane 4: PCRnegative control. 50 bp ladder. (C) PCR using 229E specific primers.Lane 1: 229E coronavirus positive control, Lane 2: RSV patient #412,Lane 3: PCR negative control. 1 kb ladder.

DETAILED DESCRIPTION OF THE INVENTION

Bibliographic references mentioned in the present specification are, forconvenience, listed in the form of a list of references and added to theend of the examples. The whole content of such bibliographic referencesis herein incorporated by reference.

The present invention addresses the problems of the prior art, and inparticular provides at least one method, apparatus and/or product ofoligonucleotide design. In particular, there is provided a method,apparatus and/or product of probe and/or primer design. There is alsoprovided a method, apparatus and/or product of nucleic acid(s)detection.

While the concept of using oligonucleotide hybridization microarrays asa tool for determining the presence of pathogens has been proposed,significant hurdles remain, thus preventing the use of these microarraysroutinely (Striebel, H. M., 2003). These hurdles include probe designand data analysis (Striebel, H. M., 2003; Bodrossy, L. & Sessitsch, A.,2004; Vora, G. J., et al., 2004). The present inventors observed in apilot microarray that despite meticulous probe selection, the best insilico designed probes do not necessarily hybridize well to patientsamples. The inventors realized that to generate probes which wouldhybridize consistently well to patient material, it was necessary todevelop a new and/or improved method of probe design so as to determinethe optimal design predictors. In particular, as described in theExample section, the present inventors created a microarray comprisingoverlapping 40-mer probes, tiled across 35 viral genomes. However, theinvention is not limited to this particular application, probe lengthand type of target nucleic acid.

According to a particular aspect of the invention, the present inventorsdescribe how a support, in particular a microarray platform, isoptimized so as to become a viable tool in target nucleic aciddetection, in particular, in pathogen detection. The inventors alsoidentified probe design predictors, including melting temperature,GC-content of the probe, secondary structure, hamming distance,similarity to human genome, effect of PCR primer tag in random PCRamplification efficiency, and/or the effect of sequence polymorphism.These results were considered and/or incorporated into the developmentof a method and criteria for probe and/or primer design. According to amore particular aspect, the inventors developed a data analysisalgorithm which may accurately predict the presence of a target nucleicacid, which may or may not be a pathogen. For example the pathogen maybe, but not limited to, a virus, bacteria and/or parasite(s). Thealgorithm may be used even if probes are not ideally designed. Thisdetection algorithm, coupled with a probe design methodology,significantly improves the confidence level of the prediction (seeTables 6 and 7).

According to a particular aspect, the method of the invention may notrequire a prediction of the likely pathogen, but may be capable ofdetecting most known human viruses, bacteria and/or parasite(s), as wellas some novel species, in an unbiased manner. Genome or a fragmentthereof is defined as all the genetic material in the chromosomes of anorganism. DNA derived from the genetic material in the chromosomes of aparticular organism is genomic DNA. A genomic library is a collection ofclones made from a set of randomly generated overlapping DNA fragmentsrepresenting the entire genome of an organism. The rationale behind thisdetection platform according to the invention is that each species ofvirus, bacteria and/or parasite(s) contains unique molecular signatureswithin the primary sequence of their genomes. Identification of thesedistinguishing regions allows for rational oligonucleotide probe designfor the specific detection of individual species, and in some cases,individual strains. The concomitant design and/or preparation ofoligonucleotide (oligo) probes that represent the most highly conservedregions among family and genus members, will enable the detection andpartial characterization of some novel pathogens. Furthermore, theinclusion of all such probes in a single support may allow the detectionof multiple viruses, bacteria and/or parasite(s) that simultaneouslyco-infect a clinical sample. The support may be an insoluble support, inparticular a solid support. For example, a microarray or a biochipassay.

According to a particular aspect, the invention may be used as adiagnostic tool, depending on the way in which oligonucleotide probesare designed, and/or how the data generated by the microarray isinterpreted and analyzed.

Determination of Efficiency of Amplification

According to a first aspect, the present invention provides a method ofdesigning oligonucleotide probe(s) for nucleic acid detection comprisingthe following steps in any order:

-   -   (i) identifying and/or selecting at least one region of at least        one target nucleic acid to be amplified, the region(s) having an        efficiency of amplification (AE) higher than the average AE; and    -   (ii) designing at least one oligonucleotide probe capable of        hybridizing to the identified and/or selected region(s).

In particular, in step (i) a score of AE is determined for everyposition i on the length of the target nucleic acid or of a regionthereof and an average AE is obtained. Those regions showing an AEhigher than the average are selected as the region(s) of the targetnucleic acid to be amplified. In particular, the AE of the selectedregion(s) may be calculated as the Amplification Efficiency Score (AES),which is the probability that a forward primer r_(i) can bind to aposition i and a reverse primer r_(j) can bind at a position j of thetarget nucleic acid, and |i−j| is the region of the target nucleic aciddesired to be amplified. In particular, the region |i−j| may be ≦10000bp, more in particular ≦5000 bp, or ≦1000 bp, for example ≦500 bp. Inparticular, the forward and/or reverse primers may be random primers.

According to another aspect, the step (i) of identifying and/orselecting region(s) of a target nucleic acid to be amplified comprisesdetermining the effect of geometrical amplification bias for everyposition of a target nucleic acid, and selecting the region(s) to beamplified as the region(s) having an efficiency of amplification (AE)higher than the average AE. The geometrical amplification bias may bedefined as the capability of some regions of a nucleic acid to beamplified more efficiently than other regions. For example, thegeometrical amplification bias is the PCR bias.

Modeling of Amplification Efficiency

Since it is not known what target nucleic acid (for example a pathogen)exists within the patient sample, random primers may be used during theamplification step and/or the reverse-transcription (RT) process toensure unbiased reverse-transcription of all RNA present into DNA. Anyrandom amplification method known in the art may be used for thepurposes of the present invention. In the present description, therandom amplification method may be RT-PCR.

However, it will be clear to a skilled person that the method of thepresent invention is not limited to RT-PCR. In particular, the RT-PCRapproach may be susceptible to signal inaccuracies caused byprimer-dimer bindings and poor amplification efficiencies in the RT-PCRprocess (Bustin, S. A., et al, 2004). To overcome this hurdle, theinventors have modeled the RT-PCR process by using random primers.

According to a particular aspect of the invention, the amplificationstep comprises forward and reverse primers, and each of the forward andreverse primers comprises, in a 5′-3′ orientation, a fixed primer headerand a variable primer tail, and wherein at least the variable tailhybridizes to a portion of the target nucleic acid v_(a). The size ofthe fixed primer header and that of the variable primer tail may be ofany size, in mer, suitable for the purposes of the method according tothe present invention. The fixed header may be 10-30 mer, in particular,15-25 mer, for example 17 mer. The variable tail may be 1-20 mer, inparticular, 5-15 mer, for example 9 mer. An example of these forward andreverse primers is shown in FIG. 1. More in particular, theamplification step may comprise forward and/or reverse random primershaving the nucleotide sequence 5′-GTTTCCCAGTCACGATANNNNNNNNN-3′, (SEQ IDNO:1), wherein N is any one of A, T, C, and G or a derivative thereof.

According to a particular embodiment, also exemplified in FIG. 1, thepresent inventors have modeled the random RT-PCR process as follows. Letv_(a) be the actual virus in the sample. The random primer used in theRT-PCR process was preferably a 26-mer primer having a fixed 17-merheader and a variable 9-mer tail of the form(5′-GTTTCCCAGTCACGATANNNNNNNN-3′) (SEQ ID NO:1 and, in particular, SEQID NOS:2-7). However, it is clear to a skilled person the that primeraccording to the invention is not limited to the sequence(s) of SEQ IDNOS:1-7 and FIG. 1. In fact, nucleotide size of the primer, and inparticular of the header and variable tail may be varied and chosenwithin the ranges discussed above. To obtain a RT-PCR product in aregion between positions i and j of v_(a), the inventors required (1) aforward primer binding to position i, (2) |i−j|≦10000, and (3) a reverseprimer binding to position j. In particular, |i−j|, which is the regionof the target nucleic acid desired to be amplified, may be ≦5000 bp,more in particular ≦1000, for example ≦500 bp. The quality of the RT-PCRproduct depends on how well the forward primer and/or the reverse primerbind to v_(a). Some random primers can bind to v_(a) better than others.The identification of such primers and where they bind to v_(a) gives anindication of how likely a particular region of v_(a) will be amplified.Using this approach, there is provided an amplification efficiency modelwhich computes an Amplification Efficiency Score (AES) for everyposition of v_(a).

For a particular position i of a target nucleic acid v_(a), P^(f) (i)and P^(r) (i) are the probabilities that a random primer r_(i) can bindto position i of v_(a) as forward primer and reverse primerrespectively. For simplicity, it is assumed that a random primer canonly bind to v_(a) if the last. 9 nucleotides of the random primer is asubstring of the reverse complement of v_(a) (forward primer) or asubstring of v_(a) (reverse primer). This is shown in FIG. 1. Based onwell-established primer design criteria (Wu, D. Y., et al., 1991), theP^(f) (i) was estimated to be low if r_(i) forms a significantprimer-dimer or has extreme melting temperature. On the other hand, ifr_(i) does not form any significant primer-dimer and has optimal meltingtemperature, then P^(f) (i) will be high. Note that if the header of therandom primer is similar to v_(a), it may also aid in the binding andthus result in a higher P^(f) (i). Similarly, the P^(r) (i) wascomputed.

The binding of the random primer r_(i) at position i of v_(a) as aforward primer affects the quality of the RT-PCR product for at least10000 nucleotides upstream of position i. Similarly, the binding of therandom primer r_(i) at position i of v_(a) as a reverse primer affectsthe quality of the RT-PCR product for at least 10000 nucleotidesdownstream of position i. Thus, an amplification efficiency score,AES_(i), for every position i of v_(a) can be computed by consideringthe combined effect of all forward and reverse primer-pairs thatamplifies it:

${{AES}_{i} = {\sum\limits_{j = {i - Z}}^{i}\; \left\{ {{P^{f}(j)} \cdot {\sum\limits_{k = {\max {({{i + 1},{j + 500}})}}}^{j + Z}\; {P^{r}(k)}}} \right\}}},{{{wherein}\mspace{14mu} {\sum\limits_{k = {\max {({{i + 1},{j + 500}})}}}^{j + Z}\; {P^{r}(k)}}} = {{P^{r}\left( {i + 1} \right)} + {P^{r}\left( {i + 2} \right)} + {{\ldots P}^{r}\left( {j + Z} \right)}}}$

-   -   P^(f) (i) and P^(r) (i) is the probability that a random primer        r_(i) can bind to position i of v_(a) as forward primer and        reverse primer, respectively, and Z≦10000 bp is the region of        v_(a) desired to be amplified.

Accordingly, Z may be ≦10000 bp, ≦5000 bp, ≦1000 bp or ≦500 bp. Toverify if the variation in signal intensities displayed by differentregions of a virus has direct correlation with their correspondingamplification efficiency scores, several microarray experiments (in theparticular case, a total of five microarray experiments) were performedon a common pathogen affecting human, the human respiratory syncytialvirus B (RSV B).

Modeling of RT-PCR for Amplification Efficiency

According to the method of the invention, which is an improvement of themethod of (Sung et al. 2003, CSB) the primer used for the reversetranscription comprises a fixed oligonucleotide tag (header) and arandom oligonucleotide tail. In theory, the random oligonucleotide tailshould bind indiscriminately to all nucleic acids in the patient sample,initiating first strand synthesis. After the second strand synthesis,all reversed-transcribed sequences will have the fixed oligonucleotidetag (header) at both ends. These sequences are amplified by PCR, usingthe fixed oligonucleotide tag (header) as the primer to generate PCRproducts of at least 10000 bp in length. In particular, the majority ofthe amplified PCR products are between 500-1000 bp in length. Accordingto the particular embodiment, the 26-mer primer used for reversetranscription (RT) comprises a fixed 17-mer tag with a 9-mer randomtail: 5′-GTTTCCCAGTCACGATANNNNNNNNN-3′ (SEQ ID NO:1).

In our model, v_(a) represents the pathogen in the clinical sample. Togenerate at least one PCR product, for example of 500-1000 bp, in anyregion of the genome, defined by positions i and j of v_(a) requires aforward primer binding to position i and a reverse primer binding toposition j in the anti-sense direction such that 500≦|i−j|≦10000, and inparticular, such that 500≦|i−j|≦1000. The binding affinity of a primeris determined by at least two factors: (1) primer dimer formation, and(2) hybridization affinity of the primer to the virus v_(a). Genomicregions which can be successfully amplified by virtue of having idealprimer binding locations within 10000 nucleotides, in particular within1000 or 500 nucleotide, can be predicted for by calculating anAmplification Efficiency Score (AES) for every position of v_(a): FIG.1.

Amplification Efficiency Score (AES)

For each position i of v_(a), let P^(f) (i) and P^(r) (i) be theprobability that a random primer r_(i) can bind to position i of v_(a)as forward primer and reverse primer respectively. For simplicity, weassumed that a random primer can only bind to v_(a) if the nucleotide ofthe random tail of the primer (for example, the last 9 nucleotides ofthe random primer as shown in FIG. 1) is a substring of the reversecomplement of v_(a) (forward primer) or a substring of v_(a) (reverseprimer; FIG. 1). Based on well-established primer design criteria (Wuand Ugozzoli, 1991), we estimated P^(f) (i) to be low if r, formed asignificant primer-dimer or had extreme melting temperature. On theother hand, if r_(i) did not form any significant primer-dimer and hadoptimal melting temperature, then P^(f) (i) will be high. If the fixedoligonucleotide tag (header) of the random primer (for example, thefixed 17-mer tag shown in FIG. 1) is similar to v_(a), it may also aidin the binding and thus result in a higher P^(f) (i). Similarly, wecomputed P^(r) (i).

The binding of the random primer r, at position i of v_(a) as a forwardprimer affects the quality of the RT-PCR product for the nucleotidesupstream of position i (for example, for the 500 to 1000 nucleotidesupstream of position i). Similarly, the binding of the random primerr_(i) at position i of v_(a) as a reverse primer affects the quality andcoverage of the RT-PCR product for the nucleotides downstream ofposition i (for example, for the 500 to 1000 nucleotides downstream ofposition i). Consider a position x of v_(a). All effective primer pairsthat reside at positions i and j respectively contribute to the qualityof the RT-PCR product at x. Note that i≦x≦j and i−j≦10000. For example,500≦i−j≦1000 since our RT-PCR product when 500 to 1000 basepairs long.Thus, an Amplification Efficiency Score, AES_(x), for every position xof v_(a) can be computed by considering the combined effect of allprimer pairs that amplifies it:

${AES}_{i} = {\sum\limits_{j = {x - 1000}}^{i}\; \left\{ {{P^{f}(j)} \cdot {\sum\limits_{k = {\max {({{i + 1},{j + 500}})}}}^{j + 1000}\; {P^{r}(k)}}} \right\}}$

AES Threshold Predictive of Successful RT-PCR

The threshold for amplification efficiency scores for probe selectionfor a virus v_(a) is determined by the cumulative distribution functionof the AES values v_(a). Let X be the random variable representing theAES values of all probes of v_(a). Let k be the number of probes inv_(a). Then, we denote the probability that the AES value is less thanor equal to x be P(X≦x)=c/k where c is the number of probes which haveAES values less than or equal to x. For a probe p_(i) at position i ofv_(a), let x_(i) be its corresponding AES value. Since the signalintensity of a probe is highly correlated to its AES value, we estimatedP(p_(i)|v_(a)), the probability that p_(i) has high signal intensity inthe presence of v_(a), to be P(X≦x_(i)). Thus,

${{P\left( p_{i} \middle| v_{a} \right)} \approx {P\left( {X \leqq x_{i}} \right)}} = \frac{c_{i}}{k}$

where c_(i) is the number of probes whose AES values are less than orequal to x_(i).

For probe selection, probe p_(i) is selected if P(p_(i)|v_(a))>λ. In ourexperiments, we set λ=0.8. At this threshold (top 20% AES), we observedthat more than 50% of expected probes would hybridize reproducibly todifferent clinical samples. While using probes with higher AES (eg. top10% AES) would improve reproducibility, this would reduce the number ofunique probes remaining for some genomes to <10 at the species level,consequently eroding the ability of the array to specifically identifypathogen. Thus top 20% AES was used.

Empirical Determination of Cross-Hybridization Thresholds on a PathogenDetection Microarray: Probe Design

The step (ii) of designing oligonucleotide probe(s) capable ofhybridizing to the selected region(s) may be selected to any one of theprobe designing techniques known in the art. The following descriptionrelates to probe design, however, it will be clear to a skilled personto apply the same principle also for designing primer(s), in particular,for designing primer(s) for RT-PCR.

For example, given a set of target nucleic acids (for example, viralgenomes) V={v₁, v₂, . . . , v_(n)}, for every v_(i)∈V, a set of length-mprobes (that is a substring of v_(i)) which satisfies the followingconditions may be designed taking into consideration, for example, atleast one of the following:

-   -   (a) established probe design criteria of homogeneity,        sensitivity and specificity (Sung, W. K. et al, 2003, CSB);    -   (b) no significant sequence similarity to human genome; and    -   (c) efficiently amplified using AE score, for example by RT-PCR,        as herein described.

Noisy signals caused by cross-hybridization artifacts present a majorobstacle to the interpretation of microarray data, particularly for theidentification of rare pathogen sequences present in a complex mixtureof nucleic acids. For example, in clinical specimens, contaminatingnucleic acid sequences such as those derived from the host tissue, willcross-hybridize with pathogen-specific microarray probes above somethreshold of sequence complementarity. This can result in false-positivesignals leading to erroneous conclusions. Similarly, the pathogensequence, in addition to binding its specific probes, maycross-hybridize with other non-target probes (i.e., designed to detectother pathogens). This latter phenomenon, though seemingly problematic,could provide useful information for pathogen identification to theextent that such cross-hybridization may be accurately predicted. Withvarious metrics to assess annealing potential and sequence specificity,microarray probes have traditionally been designed to ensure maximalspecific hybridization (to a known target) with-minimalcross-hybridization (to non-specific sequences). However, in practice wehave found that many probes, though designed using optimal in silicoparameters, do not perform according to expectations for reasons thatare unclear.

To systematically investigate the dynamics of array-based pathogendetection, we created an oligonucleotide array using Nimblegen arraysynthesis technology (Nuwaysir et al. 2002). The array was designed todetect up to 35 RNA viruses using 40-mer probes tiled at an average8-base resolution across the full length of each genome (53,555 probes;FIG. 6, Table 1).

TABLE 1 List of Genomes represented on the pathogen detectionmicroarray. (column 1) Number of probes for each genome synthesized onthe microarray. (column 2) Number of probes for each genome remainingfollowing application of probe design filters. (column 3) Number ofprobes for each genome which are unique to the genome and do notcross-hybridize with human. Original Filtered No. of No. of UniqueProbes Probes Probes NCBI GI Genome (1) (2) (3) number Ref typeAccession no. Description 1 1948 537 271 9629198 RefSeq NC_001781.1Human respiratory syncytial virus, complete genome 2 1995 550 29519718363 RefSeq NC_003461.1 Human parainfluenza virus 1 strainWashington/1964, complete genome 3 2002 762 474 19525721 RefSeqNC_003443.1 Human parainfluenza virus 2, complete genome 4 1979 701 34510937870 RefSeq NC_001796.2 Human parainfluenza virus 3, complete genome5 3805 588 444 30468042 Genbank AY283794.1 SARS coronavirus Sin2500,complete genome 6 3937 604 356 38018022 RefSeq NC_005147.1 Humancoronavirus OC43, complete genome 7 3495 182 112 12175745 RefSeqNC_002645.1 Human coronavirus 229E, complete genome 8 1705 292 17746852132 RefSeq NC_004148.2 Human metapneumovirus, complete genome 9 296118 101 8486138 RefSeq NC_002023.1 Influenza A virus RNA segment 1,complete sequence 10 282 69 42 8486136 RefSeq NC_002022.1 Influenza Avirus RNA segment 3, complete sequence 10 296 81 54 8486134 RefSeqNC_002021.1 Influenza A virus RNA segment 2, complete sequence 10 110 6957 8486131 RefSeq NC_002020.1 Influenza A virus RNA segment 8, completesequence 10 196 71 62 8486129 RefSeq NC_002019.1 Influenza A virus RNAsegment 5, complete sequence 10 177 75 59 8486127 RefSeq NC_002018.1Influenza A virus RNA segment 6, complete sequence 10 225 70 51 8486125RefSeq NC_002017.1 Influenza A virus RNA segment 4, complete sequence 10300 105 48 8486164 RefSeq NC_002204.1 Influenza B virus RNA-1, completesequence 10 293 113 74 8486148 RefSeq NC_002205.1 Influenza B virusRNA-2, complete sequence 10 279 94 59 8486150 RefSeq NC_002206.1Influenza B virus RNA-3, complete sequence 10 237 70 53 8486152 RefSeqNC_002207.1 Influenza B virus RNA-4, complete sequence 10 232 90 828486154 RefSeq NC_002208.1 Influenza B virus RNA-5, complete sequence 10195 64 32 8486156 RefSeq NC_002209.1 Influenza B virus RNA-6, completesequence 10 150 47 37 8486159 RefSeq NC_002210.1 Influenza B virusRNA-7, complete sequence 10 136 59 50 8486161 RefSeq NC_002211.1Influenza B virus RNA-8, complete sequence 11 1401 85 54 11528013 RefSeqNC_001563.2 West Nile virus, complete genome 12 1389 145 123 9627244RefSeq NC_002031.1 Yellow fever virus, complete genome 13 2335 235 17113559808 RefSeq NC_002728.1 Nipah virus, complete genome 14 1943 244 21111545722 RefSeq NC_002617.1 Newcastle disease virus, complete genome 151174 208 128 9629357 RefSeq NC_001802.1 Human immunodeficiency virus 1,complete genome 16 409 134 106 21326584 RefSeq NC_003977.1 Hepatitis Bvirus, complete genome 17 1011 169 135 9627257 RefSeq NC_001576.1 Humanpapillomavirus type 10, complete genome 18 1036 325 299 10445391 RefSeqNC_002554.1 Foot-and-mouth disease virus C, complete genome 19 1246 211209 9790308 RefSeq NC_001545.1 Rubella virus, complete genome 20 955 309172 9626732 RefSeq NC_001489.1 Hepatitis A virus, complete genome 21 834103 29 38371716 RefSeq NC_005222.1 Hantaan virus, complete genome 22 837188 98 38371727 RefSeq NC_005217.1 Sin Nombre virus, complete genome 23430 100 86 23334588 RefSeq NC_004294.1 Lymphocytic choriomeningitisvirus segment S, complete sequence 23 853 455 286 23334585 RefSeqNC_004291.1 Lymphocytic choriomeningitis virus segment L, completesequence 24 1404 204 122 9626460 RefSeq NC_001437.1 Japaneseencephalitis virus, genome 25 1370 284 91 51850386 DNA AB189128.1 Denguevirus type 3 genomic RNA, Database complete genome, strain: of Japan98902890 DF DV-3 26 1361 130 57 12659201 Genbank AF326573.1 Dengue virustype 4 strain 814669, complete genome 27 1370 142 21 19744844 GenbankAF489932.1 Dengue Virus Type 2 strain BR64022, complete genome 28 1370152 52 323660 Genbank M87512.1 DENT1SEQ Dengue virus type 1 completegenome 29 944 175 87 9626436 RefSeq NC_001430.1 Human enterovirus D,complete genome 30 945 183 122 9626433 RefSeq NC_001428.1 Humanenterovirus C, complete genome 31 946 196 148 9627719 RefSeq NC_001612.1Human enterovirus A, complete genome 32 945 364 154 21363125 RefSeqNC_003986.1 Human echovirus 1, complete genome 33 944 94 12 9626677RefSeq NC_001472.1 Human enterovirus B, complete genome 34 913 283 1909627730 RefSeq NC_001617.1 Human rhinovirus 89, complete genome 35 920426 291 9626735 RefSeq NC_001490.1 Human rhinovirus B, complete genome

Together with 7 replicates for each viral probe, and control sequencesfor array synthesis and hybridization (as described below), the arraycontained a total of 390,482 probes.

Homogeneity, Sensitivity and Specificity

Homogeneity requires the selection of probes which have similar meltingtemperatures. It was found that probes with low CG-content did notproduce reliable hybridization signal intensities, and that probes withhigh CG-content had a propensity to produce high signal intensitiesthrough non-specific binding. Thus, it could be established that theCG-content of probes selected should be from 40% to 60%.

Accordingly, the present invention provides a method of designingoligonucleotide probe(s) for nucleic acid detection, comprisingselecting the probes having a CG-content from 40% to 60%.

The term “hybridization” refers to the process in which the oligo probesbind non-covalently to the target nucleic acid, or portion thereof, toform a stable double-stranded. Triple-stranded hybridization is alsotheoretically possible. Hybridization probes are oligonucleotidescapable of binding in a base-specific manner to a complementary strandof target nucleic acid. Hybridizing specifically refers to the binding,duplexing, or hybridizing of a molecule substantially to or only to aparticular nucleotide sequence or sequences under stringent conditionswhen that sequence is present in a complex mixture (e.g., totalcellular) of DNA or RNA. Hybridizations, e.g., allele-specific probehybridizations, are generally performed under stringent conditions. Forexample, conditions where the salt concentration is no more than about 1Molar (M) and a temperature of at least 25° C., e.g., 750 mM NaCl, 50 mMNaPhosphate, 5 mM EDTA, pH 7.4 (5 times SSPE) and a temperature of fromabout 25° C. to about 30° C. Hybridization is usually performed understringent conditions, for example, at a salt concentration of no morethan 1 M and a temperature of at least 25° C. For stringent conditions,see also for example, Sambrook and Russel, Molecular Cloning: ALaboratory Manual, Cold Springs Harbor Laboratory, New York (2001) whichis hereby incorporated by reference in its entirety for all purposesabove.

Sensitivity requires that probes that cannot form significant secondarystructures be selected in order to detect low-abundance mRNAs. Thus,probes with the highest free energy computed based on Nearest-Neighbormodel are selected (SantaLucia, J., Jr., et al., 1996).

Accordingly, the present invention provides a method of designing atleast one oligonucleotide probe for nucleic acid(s) detection, whereinthe probe(s) are selected by having the highest free energy computedbased on Nearest-Neighbor model.

Specificity requires the selection of probes that are most unique to aviral genome. This is to minimize cross-hybridization of the probes withother non-target nucleic acids (for example, viral genomes). Given probes_(a) and probe s_(b) substrings of target nucleic acids v_(a) andv_(b), s_(a) is selected based on the hamming distance between s_(a) andany length-m substring s_(b) from the target nucleic acid v_(b) and/oron the longest common substring of s_(a) and probe s_(b). In particular,let s_(a) and s_(b) be length-m substrings from viral genome v_(a) andv_(b) respectively, where (v_(a)≠v_(b)).

The length of the probe(s) to be designed may be of any length usefulfor the purposes of the present invention. The probes may be less than100 mer, for example 20 to 80 mer, 25 to 60 mer, for example 40 mer. Thehamming distance and/or longest common substring may also vary.

According to Kane's criteria (Kane, M. D., et al., 2000), s_(a) isspecific to v_(a) if:

-   -   (a) the hamming distance between s_(a) and any length-m        substring s_(b) from viral genome v_(b), is more than 0.25 m;    -   (b) the longest common substring of s_(a) and s_(b) is less than        15.

The cutoff value(s) for the hamming distance may be chosen according tothe stringency desired. It will be evident to any skilled person how toselect the hamming distance cutoff according to the particularstringency desired. According to a particular example of the hereindescribed probe design, the inventors used hamming distance cutoffsof >10 with respect to other target nucleic acids for specific probes,and <10, preferably <5 for conserved probes. With a specific probe, itindicates a probe which only hybridizes to a specific target nucleicacid, while with a conserved probe it indicates a probe which mayhybridize to any member of the family of the target nucleic acid.

Accordingly, the present invention also provides a method of designingoligonucleotide probe(s) for nucleic acid detection, wherein given probes_(a) and probe s_(b) substrings of target nucleic acids v_(a) and v_(b)comprised in the biological sample, s_(a) is selected if the hammingdistance between s_(a) and any length-m substring s_(b) from the targetnucleic acid v_(b) is more than 0.25 m, and the longest common substringof s_(a) and probe s_(b) is less than 15.

To study array hybridization dynamics without the complexity ofcross-hybridization from human RNA, SARS coronavirus and Dengue serotype1 viral RNA were purified from the media of infected cell lines,reverse-transcribed, and PCR-amplified using virus-specific primers(Wong, et. al., 2004). Each genome cDNA was amplified in its entirety(as confirmed by sequencing), labeled with Cy3 and hybridized separatelyon microarrays. The SARS sample hybridized well to the SARS tilingprobes, with all 3,805 SARS-specific probes displaying fluorescent (Cy3)signal well above the detection threshold (determined by probe signalintensities >2 standard deviations above the mean array signalintensity; FIG. 7A) Cross-hybridization with other pathogen probe setswas minimal, observed only for other members of Coronaviridae and a fewspecies of Picornaviridae and Paramyxoviridae, consistent with theobservation that SARS shares little sequence homology with other knownviruses (Ksiazek et al. 2003). The hybridization pattern of Dengue 1, onthe other hand, was more complex (FIG. 7B). First, we observed thathybridization to the Dengue 1 probe set was partially incomplete (i.e.,regions absent of signal) due to sequence polymorphisms. The Dengue 1sample hybridized on the array was cultured from a Hawaiian isolate in1944 (ATCC Catalog #VR-1254), whereas the array probe set is based onthe sequence of strain S275/90, isolated in Singapore in 1990 (Fu et al.1992). The Dengue 1 probes that failed to hybridize with the cDNA targeteach contained at least 3 mismatches (within a 15-base stretch) with thetarget sequence. Second, we observed that cross-hybridization occurredto some degree with almost all viral probe sets present on the array,particularly with probes of other Flaviviridae members, consistent withthe fact that the 4 Dengue serotypes share 60-70% homology. Tounderstand the relationship between hybridization signal output andannealing specificity, we first compared all probe sequences to eachviral genome using 2 measures of similarity: probe hamming distance (HD)and maximum contiguous match (MCM). HD measures the overall similaritydistance of two sequences, with low scores for similar sequences(Hamming, 1950). MCM measures the number of consecutive bases which areexact matches, with high scores for similar sequences (Kane et al.2000).

We calculated the HD and MCM scores for every probe relative to theHawaiian Dengue 1 isolate and observed that these scores are inverselyand directly correlated respectively to probe signal intensity. Allprobes on the array with high similarity to the Hawaiian Dengue Igenome, i.e. HD≦2 (n=942) or MCM 27 (n=627), hybridized with mediansignal intensity 3 logs above background. Although 98% of probes weredetectable at the low HD range from 0-4, or high MCM range from 18-40,median probe signal intensity decreased at every increment of sequencedistance. Median signal intensity dropped off sharply to backgroundlevels at HD=7 and MCM=15, with 43% and 46% detectable probesrespectively. The majority of probes (>96%, n>51,000) had HD scoresbetween 8-21 and/or MCM scores between 0-15, of which 1.23% and 1.57%were detectable respectively.

The ideal cross-hybridization similarity threshold would be one whereall probes identifying a specific pathogen would always have detectablesignal intensity above background noise, even in the presence ofpolymorphisms in the pathogen sequence. At the optimal similaritythresholds HD≦4 and MCM≧18, >98% of probes could be detected with mediansignal intensity 2 logs above background, whereas adjusting thethreshold down 1 step to HD55 and MCM≧17 would result in only ˜85% probedetection and median signal intensity ˜1.2 logs above background (FIG.8)

Using these optimal HD and MCM thresholds to predict forcross-hybridization, we binned all probes into groups most likely todetect a given pathogen. We refer to these groups as specific signatureprobe sets (SPSs), and we defined SPSs for each of the 35 pathogengenomes represented on the array (Table 2).

TABLE 2 Each pathogen signature probe set (SPS) comprise its probes withAES in the top 20^(th) percentile [column (1)]. Probes that do not haveGC between 40-60% [column (2)] or high similarity to human genome[column (3)] were removed. Probes derived from other pathogens whichwill cross-hybridize to the pathogen based on HD and MCM [column (4)]were added to the SPS [column (5)]. No. of predicted cross- hybridizingGC Human No. of probes No. of Total content Genome filtered (HD ≦ 4 andprobes tiling AES filter filter probes MCM ≧ 18) in SPS Pathogen Familyprobes (1) (2) (3) left (4) (5) 1 LCMV Arenaviridae 1283 574 1 18 555 0555 2 Hantaan Bunyaviridae 834 131 6 22 103 2 105 3 Sin NombreBunyaviridae 837 225 8 29 188 3 191 4 229E Coronaviridae 3495 196 2 12182 2 184 5 OC43 Coronaviridae 3937 663 16 43 604 3 607 6 SARSCoronaviridae 3805 672 6 78 588 3 591 7 Dengue serotype 1 Flaviviridae1370 201 2 47 152 50 202 8 Dengue serotype 2 Flaviviridae 1370 178 0 36142 71 213 9 Dengue serotype 3 Flaviviridae 1370 336 1 51 284 69 353 10Dengue serotype 4 Flaviviridae 1361 172 1 41 130 44 174 11 JapaneseFlaviviridae 1404 274 6 64 204 40 244 encephalitis 12 West NileFlaviviridae 1401 111 4 22 85 22 107 13 Yellow Fever Flaviviridae 1389151 0 6 145 10 155 14 Hepatitis B Hepadnaviridae 409 146 2 10 134 0 13415 Influenza A Orthomyxoviridae 1582 601 2 46 553 0 553 16 Influenza BOrthomyxoviridae 1822 718 7 69 642 2 644 17 Human Papillomaviridae 1011177 1 7 169 0 169 papillomavirus type 10 18 hMPV Paramyxoviridae 1705375 23 60 292 8 300 19 Newcastle Paramyxoviridae 1943 252 0 8 244 0 244disease 20 Nipah Paramyxoviridae 2335 274 22 17 235 0 235 21Parainfluenza 1 Paramyxoviridae 1995 625 13 62 550 3 553 22Parainfluenza 2 Paramyxoviridae 2002 838 31 45 762 0 762 23Parainfluenza 3 Paramyxoviridae 1979 834 29 104 701 9 710 24 RSV BParamyxoviridae 1948 655 52 66 537 4 541 25 Echovirus 1 Picornaviridae945 439 3 72 364 59 423 26 Enterovirus A Picornaviridae 946 205 0 9 19621 217 27 Enterovirus B Picornaviridae 944 109 0 15 94 47 141 28Enterovirus C Picornaviridae 945 202 0 19 183 31 214 29 Enterovirus DPicornaviridae 944 191 0 16 175 15 190 30 Foot and mouth Picornaviridae1036 356 26 5 325 0 325 disease 31 Hepatitis A Picornaviridae 955 355 937 309 0 309 32 Rhinovirus A Picornaviridae 913 333 2 48 283 13 296(type 89) 33 Rhinovirus B Picornaviridae 920 464 3 35 426 11 437 34 HIV1 Retroviridae 1174 229 4 17 208 0 208 35 Rubella Togaviridae 1246 748534 3 211 0 211 Total 53555 10955 11497

Each pathogen's SPS comprised tiling probes derived from its genomesequence (HD=0, MCM=40), as well as cross-hybridizing probes derivedfrom other pathogens (HD≦4, MCM≧18).

We next considered other non-specific hybridization phenomena that couldaffect performance of our SPS probes. For example, we observed a generalrelationship between probe signal and % GC content. Consistent withprevious observations, we found that probes <40% GC resulted indiminished signal intensities, while probes >60% GC content showedhigher signal intensities (Wong et al. 2004; Maskos and Southern, 1993).Thus, we utilized % GC content as an additional selection filter,whereby probes with <40% GC and >60% GC were excluded from our SPSs,despite optimal HD and MCM values.

Sequence Similarity to Human Genome

In case the target nucleic acid to be detected is extracted from humans(for example, human samples containing viral genomes), probes with highhomology to the human genome should also be avoided. Accordingly, forany probe s_(a) of length-m specific for the target nucleic acid v_(a),the probe s_(a) is selected if it does not have any hits with any regionof a nucleic acid different from the target nucleic acid, and if theprobe s_(a) length-m has hits with the nucleic acid different from thetarget nucleic acid, the probe s_(a) length-m with the smallest maximumalignment length and/or with the least number of hits is selected. Inparticular, for any length-m probe s_(a), hits of s_(a) with the humangenome are found with the BLAST algorithm (Altschul, S. F., et al.,1997). A BLAST word size of (W=15) and an expectation value of 100 wasused to find all hits. s_(a) is selected if it does not have any hitswith the human genome, that is, it is specific to v_(a). However, if alllength-m substrings of v_(a) have hits with the human genome, those withthe smallest maximum alignment length and with the least number of hitswas selected.

Furthermore, as cross-hybridization with human sequences could alsoconfound results, we compared all probes to the human genome assembly(build 17) (International Human Genome Sequencing Consortium. Initialsequencing and analysis of the human genome. Nature 409(6822), 860-921(2001).) by BLAST using a word size of 15 (Altschul et al. 1997). Probeswith expectation value of 100 were further filtered from the SPSs (seeTable 2 above).

Accordingly, the present invention provides a method of designingoligonucleotide probe(s) for nucleic acid detection, wherein for anyprobe s_(a) of length-m specific for the target nucleic acid v_(a), theprobe s_(a) is selected if it does not have any hits with any region ofa nucleic acid different from the target nucleic acid, and if the probes_(a) length-m has hits with the nucleic acid different from the targetnucleic acid, the probe s_(a) length-m with the smallest maximumalignment length and/or with the least number of hits is selected.

Further, the design of the oligonucleotide probe(s) may be also carriedout by AES according to the invention. In particular, the inventionprovides a method of selecting and/or designing probes wherein a probep_(i) at position i of a target nucleic acid is selected if p_(i) ispredicted to hybridize to the position i of the amplified target nucleicacid.

In particular, the oligonucleotide probe(s) capable of hybridizing tothe selected region(s) may be selected and/or designed according to atleast one of the following criteria:

-   -   (a) the selected probe(s) has a CG-content from 40% to 60%;    -   (b) the probe(s) is selected by having the highest free energy        computed based on Nearest-Neighbor model;    -   (c) given probe s_(a) and probe s_(b) substrings of target        nucleic acids v_(a) and v_(b), s_(a) is selected based on the        hamming distance between s_(a) and any length-m substring s_(b)        from the target nucleic acid v_(b) and/or on the longest common        substring of s_(a) and probe s_(b);    -   (d) for any probe s_(a) of length-m specific for the target        nucleic acid v_(a), the probe s_(a) is selected if it does not        have any hits with any region of a nucleic acid different from        the target nucleic acid, and if the probe 5, length-m has hits        with the nucleic acid different from the target nucleic acid,        the probe s_(a) length-m with the smallest maximum alignment        length and/or with the least number of hits is selected; and/or    -   (e) a probe p_(i) at position i of a target nucleic acid is        selected if p_(i) is predicted to hybridize to the position i of        the amplified target nucleic acid.

According to a particular aspect of the invention, two or more of thecriteria indicated above may be used for designing the oligonucleotideprobe(s). For example, the probe(s) may be designed by applying allcriteria (a) to (e). Other criteria, not explicitly mentioned herein butwhich are evident to a skilled person in the art may also be used.

In particular, under the criterion (e), a probe p_(i) at position i of atarget nucleic acid v_(a) is selected if P(p_(i)|v_(a))>λ, wherein λ is0.5 and P(p_(i)|v_(a)) is the probability that p_(i) has to hybridize tothe position i of the target nucleic acid v_(a). More in particular, λis 0.8.

According to another aspect, the invention provides a method as abovedescribed wherein

${{{P\left( p_{i} \middle| v_{a} \right)} \approx {P\left( {X \leqq x_{i}} \right)}} = \frac{c_{i}}{k}},$

wherein X is the random variable representing the amplificationefficiency score (AES) values of all probes of v_(a), k is the number ofprobes in v_(a), and c_(i) is the number of probes whose AES values are≦x_(i).

According to another aspect, the AES can also be used to design randomprimer tags to facilitate random amplification of sample by random PCR(for use in applications such as detection of pathogens, detection ofgene expression, constructing clonal DNA libraries, and otherapplications a skilled person would employ random PCR).

Synthesis of Oligonucleotide Probes on a Support

According to another aspect of the invention, the method of selectingand/or designing at least one oligonucleotide probe(s) as describedabove further comprises a step of preparing the selected and/or designedprobe(s). Designing a probe comprises understanding its sequence and/ordesigning it by any suitable means, for example by using a software. Thestep of preparing the probe comprises the physical preparation of it.The probe may be prepared according to any standard method known in theart. For example, the probes may be chemically synthesized or preparedby cloning. For example, as described in Sambrook and Russel, 2001.

There is also provided a support, for example a microarray or biochip,prepared according to any embodiment according to the present invention.

The probe(s) designed and prepared according to any method of thepresent invention may used in solution or may be placed on an insolublesupport. For example, may be applied, spotted or printed on an insolublesupport according to any technique known in the art. The support may bea solid support or a gel. The support with the probes applied on it, maybe a microarray or a biochip.

More in particular, the present invention provides an oligo microarrayhybridization-based approach for the rapid detection and identificationof pathogens, for example viral and/or bacterial pathogens, fromPCR-amplified cDNA prepared from primary tissue samples. In particular,from random PCR-amplified cDNA(s).

In the following description, the preparation of probes is made withparticular reference to a microarray. However, the support, as well asthe probes, may be prepared according to any description across thewhole content of the present application. In particular, an “array” isan intentionally created collection of molecules which may be preparedeither synthetically or biosynthetically. The molecules in the array maybe identical or different from each other. The array may assume avariety of formats, e.g., libraries of soluble molecules; libraries ofcompounds tethered to resin beads, silica chips, or other solidsupports. Array Plate or a Plate is a body having a plurality of arraysin which each array is separated from the other arrays by a physicalbarrier resistant to the passage of liquids and forming an area orspace, referred to as a well.

Sample Preparation and Hybridization onto the Microarray

The biological sample may be any sample taken from a mammal, for examplefrom a human being. The biological sample may be blood, a body fluid,saliva, urine, stool, and the like. The biological sample may be treatedto free the nucleic acid comprised in the biological sample beforecarrying out the amplification step. The target nucleic acid may be anynucleic acid which is intended to be detected. The target nucleic acidto be detected may be at least a nucleic acid exogenous to the nucleicacid of the biological sample. Accordingly, if the biological sample isfrom a human, the exogenous target nucleic acid to be detected (ifpresent in the biological sample) is a nucleic acid which is not fromhuman origin. According to an aspect of the invention, the targetnucleic acid to be detected is at least a pathogen genome or fragmentthereof. The pathogen nucleic acid may be at least a nucleic acid from avirus, a parasite, or bacterium, or a fragment thereof.

According to an aspect of the present invention, there is provided amethod of target nucleic acid detection analysis. The target nucleicacid(s) from a biological sample desired to be detected may be anytarget nucleic acid, RNA and/or DNA. For example, mRNA and/or cDNA. Morein particular, the target nucleic acid to be detected may be a pathogenor non-pathogen. For example, it may be the genome or a fragment thereofof at least one virus, at least one bacterium and/or at least oneparasite. The probes selected and/or prepared may be placed, appliedand/or fixed on a support according to any standard technology known toa skilled person in the art. The support may be an insoluble support,for example a solid support. In particular, a microarray and/or abiochip.

According to a particular example, RNA and DNA was extracted frompatient samples e.g. tissues, sera, nasal pharyngeal washes, stool usingestablished protocols and commercial kits. For example, Qiagen Kit fornucleic acid extraction may be used. Alternatively, Phenol/Chloroformmay also be used for the extraction of DNA and/or RNA. Any techniqueknown in the art, for example as described in Sambrook and Russel, 2001may be used. RNA was reverse-transcribed to cDNA using tagged randomprimers, based on a protocol described by Bohlander et. al., 1992 andWang et. al., 2003. The cDNA was then amplified by random PCR.Fragmentation, labeling and hybridization of sample to the microarraywere carried out as described by Wong et. al., 2004.

Microarray Synthesis

According to a particular experiment described in the Examples section,the present inventors selected several viral genomes representing themost common causes of viral disease in Singapore. Using the completegenome sequences downloaded from Genbank, 40-mer probes which tiledacross the entire genomes and overlapping at five-base resolution weregenerated. Seven replicates of each virus probe were synthesizeddirectly onto the microarray using Nimblegen technology (Nuwaysir, E.F., et al., 2002). The probes were randomly distributed on themicroarray to minimize the effects of hybridization artifacts. Tocontrol for non-specific hybridization of sample to probes, 10,000oligonucleotide probes were designed and synthesized onto themicroarray. These 10,000 oligonucleotides did not have any sequencesimilarity to the human genome, or to the pathogen genomes. They wererandom probes with 40-60% CG-content. These probes measured thebackground signal intensity. As a positive control, 400 oligonucleotideprobes to human genes which have known or inferred functions in immuneresponse were synthesized on the array. A plant virus, PMMV, wasincluded as a negative control, for a total of approximately 380,000probes. In the following description, the invention will be described inmore particularity with reference to a pathogen detection chip analysis(also referred to as PDC). However, the analysis (method) is not limitedto this particular embodiment, but encompasses the several aspects ofthe invention as described across the whole content of the presentapplication.

Method of Detecting Target Nucleic Acid(s)

According to another aspect, the present invention provides a method ofdetecting at least one target nucleic acid comprising the step of:

-   -   (i) providing a biological sample;    -   (ii) amplifying nucleic acid(s) comprised in the biological        sample;    -   (iii) providing at least one oligonucleotide probe capable of        hybridizing to at least one target nucleic acid, if present in        the biological sample, wherein the probe(s) is prepared by using        a method according to any aspect of the invention herein        described;    -   (iv) contacting the probe(s) with the amplified nucleic acids        and detecting the probe(s) hybridized to at least one target        nucleic acid.

The amplification step (ii) may be carried out in the presence ofrandom, partially random (that is, comprising a fixed portion and arandom portion) or specific primers. In particular, the amplificationstep (II) may be carried out in presence of at least one random primer.More in particular, in the presence of at least one random forwardprimer and/or at least one random reverse primer. For example, theamplification step (ii) may be carried out in the presence of more thantwo random primers. Any amplification method known in the art may beused. For example, the amplification method is a RT-PCR.

In particular, the present inventors developed a method of detecting theprobe(s) hybridized to the to the target nucleic acid based on theamplification efficiency score (AES). This may herein also be referredto as the algorithm according to the present invention. In particular, aforward random primer binding to position i and a reverse random primerbinding to position j of a target nucleic acid v_(a) are selected amongprimers having an amplification efficiency score (AES_(i)) for everyposition i of a target nucleic acid v_(a) of:

${{AES}_{i} = {\sum\limits_{j = {i - Z}}^{i}\; \left\{ {{P^{f}(j)} \cdot {\sum\limits_{k = {\max {({{i + 1},{j + 500}})}}}^{j + Z}\; {P^{r}(k)}}} \right\}}},{{{wherein}\mspace{14mu} {\sum\limits_{k = {\max {({{i + 1},{j + 500}})}}}^{j + Z}\; {P^{r}(k)}}} = {{P^{r}\left( {i + 1} \right)} + {P^{r}\left( {i + 2} \right)} + {{\ldots P}^{r}\left( {j + Z} \right)}}}$

P^(f) (i) and P^(r) (i) are the probabilities that a random primer r_(i)can bind to position i of v_(a) as forward primer and reverse primer,respectively, and Z≦10000 bp is the region of v_(a) desired to beamplified. More in particular, Z may be ≦5000 bp, ≦1000 bp, or ≦500 bp.

The amplification step may comprise forward and reverse primers, andeach of the forward and reverse primers may comprise, in a 5′-3′orientation, a fixed primer header and a variable primer tail, andwherein at least the variable tail hybridizes to a portion of the targetnucleic acid v_(a). In particular, the amplification step may compriseforward and/or reverse random primers having the nucleotide sequence ofany of SEQ ID NOS:1-7, or a variant, or derivative thereof.

The biological sample may be any sample taken from a mammal, for examplefrom a human being. The biological sample may be tissue, sera, nasalpharyngeal washes, saliva, any other body fluid, blood, urine, stool,and the like. The biological sample may be treated to free the nucleicacid comprised in the biological sample before carrying out theamplification step. The target nucleic acid may be any nucleic acidwhich is intended to be detected. The target nucleic acid to be detectedmay be at least a nucleic acid exogenous to the nucleic acid of thebiological sample. Accordingly, if the biological sample is from ahuman, the exogenous target nucleic acid to be detected (if present inthe biological sample) is a nucleic acid which is not from human origin.

According to an aspect of the invention, the target nucleic acid to bedetected is at least a pathogen genome or fragment thereof. The pathogennucleic acid may be at least a nucleic acid from a virus, a parasite, orbacterium, or a fragment thereof.

Accordingly, the invention provides a method of detection of at least atarget nucleic acid, if present, in a biological sample. The method maybe a diagnostic method for the detection of the presence of a pathogeninto the biological sample. For example, if the biological sample isobtained from a human being, the target nucleic acid, if present in thebiological sample, is not from human.

The probe(s) designed and/or prepared according to any method of thepresent invention may used in solution or may be placed on an insolublesupport. For example, may be applied, spotted or printed on an insolublesupport according to any technique known in the art. The support withthe probes applied on it may be a solid support or a gel. In particular,it may be a microarray or a biochip.

The probes are then contacted with the nucleic acid of the biologicalsample, and if present the target nucleic acid(s) and the probe(s)hybridize, and the presence of the target nucleic acid is detected. Inparticular, in the detection step (iv), the mean of the signalintensities of the probes which hybridize to v_(a) is statisticallyhigher than the mean of the probes ∉v_(a), thereby indicating thepresence of v_(a) in the biological sample.

More in particular, in the detection step (iv), the mean of the signalintensities of the probes which hybridize to v_(a) is statisticallyhigher than the mean of the probes ∉v_(a), and the method furthercomprises the step of computing the relative difference of theproportion of probes ∉v_(a) having high signal intensities to theproportion of the probes used in the detection method having high signalintensities, the density distribution of the signal intensities ofprobes v_(a) being more positively skewed than that of probes ∉v_(a),thereby indicating the presence of v_(a) in the biological sample.

For example, in the detection step (iv), the presence of a targetnucleic acid in a biological sample is given by a value of t-tests 0.1and/or Anderson-Darling test values≦0.05 and/or a value of WeightedKullback-Leibler divergence of ≧1.0, preferably ≧5.0. In particular, thet-test value is ≦0.05.

According to another aspect, the present invention provides a method ofdetermining the presence of a target nucleic acid v_(a) comprisingdetecting the hybridization of a probe to a target nucleic acid v_(a)and wherein the mean of the signal intensities of the probes whichhybridize to v_(a) is statistically higher than the mean of the probes∉v_(a), thereby indicating the presence of v_(a). In particular, themean of the signal intensities of the probes which hybridize to v_(a) isstatistically higher than the mean of the probes ∉v_(a), and the methodfurther comprises the step of computing the relative difference of theproportion of probes ∉v_(a) having high signal intensities to theproportion of the probes used in the detection method having high signalintensities, the density distribution of the signal intensities ofprobes v_(a) being more positively skewed than that of probes ∉v_(a),thereby indicating the presence of v_(a). More in particular, thepresence of a target nucleic acid in a biological sample is given by avalue of t-test≦0.1 and/or Anderson-Darling test value ≦0.05 and/or avalue of Weighted Kullback-Leibler divergence of ≧1.0, preferably, ≧5.0.For example, the t-test value may be 0.05.

According to another aspect, the present invention provides a method ofdetecting at least one target nucleic acid, comprising the steps of:

-   -   (i) providing at least one biological sample;    -   (ii) amplifying nucleic acid(s) comprised in the biological        sample;    -   (iii) providing at least one oligonucleotide probe capable of        hybridizing to at least one target nucleic acid, if present in        the biological sample;    -   (iv) contacting the probe(s) with the amplified nucleic acids        and detecting the probe(s) hybridized to target nucleic acid(s),        wherein the mean of the signal intensities of the probes which        hybridize to v_(a) is statistically higher than the mean of the        probes v_(a), thereby indicating the presence of v_(a) in the        biological sample.

In step (iv), the mean of the signal intensities of the probes whichhybridize to v_(a) is statistically higher than the mean of the probes∉v_(a), and the method further comprises the step of computing therelative difference of the proportion of probes ∉v_(a) having highsignal intensities to the proportion of the probes used in the detectionmethod having high signal intensities, the density distribution of thesignal intensities of probes v_(a) being more positively skewed thanthat of probes ∉v_(a), thereby indicating the presence of v_(a) in thebiological sample. In particular, in step (iv) the presence of a targetnucleic acid in a biological sample is given by a value of t-test 0.1and/or Anderson-Darling test value≦0.05 and/or a value of WeightedKullback-Leibler divergence of ≧1.0, preferably ≧5.0. The t-test valuemay be ≦0.05. The nucleic acid to be detected is nucleic acid exogenousto the nucleic acid of the biological sample. The target nucleic acid tobe detected may be at least a pathogen genome or fragment thereof. Thepathogen nucleic acid may be at least a nucleic acid from a virus, aparasite, or bacterium, or a fragment thereof. In particular, when thesample is obtained from a human being, the target nucleic acid, ifpresent in the biological sample, is not from the human genome. Theprobes may be placed on an insoluble support. The support may be amicroarray or a biochip.

Test Using the Template Sequence of RSV B

To verify if the variation in signal intensities displayed by differentregions of a virus has direct correlation with their correspondingamplification efficiency scores, a total of five microarray experimentswere performed on a common pathogen affecting human, the humanrespiratory syncytial virus B (RSV B).

Next, the probe design criteria, as described above, were applied on thetemplate sequence of RSV B obtained from NCBI (NC_(—)001781). Thisresulted in 1948 probes spotted onto each microarray. The amplificationefficiency map for RSV B was also computed prior to the actualexperiments and shown in FIG. 2. This figure shows the peaks having theAES higher than the average AES and indicating the regions of the RSV Bwith higher probability of amplification.

Using 5 samples containing the human respiratory syncytial virus B (RSVB), independent microarray experiments were conducted. The resultantsignal intensities for one such experiment is shown in FIG. 3.

For each experiment, the signal intensities of the 1948 probes wereranked in decreasing order and were correlated with their correspondingAES value. The p-value was found to be <2.2e⁻¹⁶ on the average. Thisindicates that the correlation between the signal intensity of probe atposition i of RSV B with AES; is not at all random. Furtherinvestigations revealed that about 300 probes, which consistentlyproduced high signal intensities in all five experiments, haveamplification efficiency scores in the 90^(th) percentile level.

Having shown that the described amplification efficiency model workswell on the RSV B genome, it was desired to show that the modelaccording to the invention may be extended to other viral genomes aswell. Another microarray experiment was performed on the humanmetapneumonia virus (HMPV). This time, there were 1705 probes on themicroarray. Again, the amplification efficiency map for HMPV wascomputed. In this experiment, the correlation test between signalintensities and amplification efficiency scores gave a p-value of1.335e⁻⁹.

Accordingly, the amplification efficiency model according to theinvention is able to predict the relative strength of signals producedby different regions of a viral genome in the described experimentset-up. Probes from regions with low amplification efficiency scoreshave a high tendency to produce no or low signal intensities. This wouldresult in a false negative on the microarray. Such probes willcomplicate the analysis of the microarray data and this is made evenmore complicated since a probe with a low signal intensity may be due toits target genome not being present or simply that it was not amplified.As such, probes in regions with reasonably high amplification efficiencyscores should be selected to minimize inaccuracies caused by the RT-PCRprocess using random primers.

The threshold for amplification efficiency scores for probe selectionfor a virus v_(a) is determined by the cumulative distribution functionof the AES values v_(a). Let X be the random variable representing theAES values of all probes of v_(a). Let k be the number of probes inv_(a). Then, we denote the probability that the AES value is less thanor equal to x be P(X≦x)=c/k, where c is the number of probes which haveAES values less than or equal to x. For a probe p_(i) at position i ofv_(a), let x_(i) be its corresponding AES value. Since the signalintensity of a probe is highly correlated to its AES value, we estimateP(p_(i)|v_(a)), the probability that p_(i) has high signal intensity inthe presence of v_(a), to be P(X≦x_(i)). Thus,

${{P\left( p_{i} \middle| v_{a} \right)} \approx {P\left( {X \leq x_{i}} \right)}} = \frac{c_{i}}{k}$

where c_(i) is the number of probes whose AES values are less than orequal to x_(i).

For probe selection, probe p_(i) is selected if P(p_(i)|v_(a))>λ. In thepresent experiments, λ was set as λ=0.8.

Accordingly, the present invention also provides a method of probedesign and/or of target nucleic acid detection wherein a probe p_(i) atposition i of a target nucleic acid v_(a) is selected ifP(p_(i)|v_(a))>λ, wherein λ is 0.75 and P(p_(i)|v_(a)) is theprobability that p_(i) has a high signal intensity in the presence ofv_(a). More in particular,

${{{P\left( p_{i} \middle| v_{a} \right)} \approx {P\left( {X \leq x_{i}} \right)}} = \frac{c_{i}}{k}},$

wherein X is the random variable representing the amplificationefficiency score (AES) values of all probes of v_(a), k is the number ofprobes in v_(a), and c, is the number of probes whose AES values areless than or equal to x_(i).

Target Nucleic Acid Detection Analysis

In the following description, the invention will be described in moreparticularity with reference to a pathogen detection chip analysis (alsoreferred to as PDC). However, the analysis (method) is not limited tothis particular embodiment, but encompasses the several aspects of theinvention as described across the whole content of the presentapplication. Therefore, in particular, given a PDC with a set oflength-m probes P={p₁, p₂, . . . , p_(i)}, which is designed for a setof viral genomes V={v₁, v₂, . . . , v_(n)}, the pathogen detection chipanalysis problem is to detect the virus present in the sample based onthe chip data. The chip data here refers to the collective informationprovided by the probe signals on the PDC. Thus, the chip data D={d₁, d₂,. . . d_(x)} is the set of corresponding signals of the probe set P onthe PDC.

Given a sample, it is not known what pathogens are present in thesample, how many different pathogens there are, if present at all.However, if a virus v_(a) is indeed in the sample, then the signalintensities of the probes of v_(a) should differ significantly from thesignal intensities of probes from other viruses. Specifically, a higherproportion of probes of v_(a) should have high signal intensitiescompared to other viruses. Hence, it would be expected that the mean ofthe signal intensities of the probes in v_(a) should be statisticallyhigher than that of probes ∉v_(a).

Accordingly, the invention provides a method wherein the mean of thesignal intensities of the probes which hybridize to v_(a) isstatistically higher than the mean of the probes ∉v_(a), which mayindicate the presence of v_(a) in the biological sample.

However, having a statistically higher mean may still be insufficient toconclude that v_(a) is in the sample. Preferably, an additional step maybe required. We need to compute the relative difference of theproportion of probes ∉v_(a) having high signal intensities to theproportion of probes on the PDC having high signal intensities. This isbased on the observation that the distribution of the signal intensitiesof probes ∈v_(a) is more positively skewed than that of probes ∉v_(a)(see the arrow in FIG. 4 A. For comparison see FIG. 4B).

Based the above observations, the chip data D for the presence ofviruses was analyzed as follows. For every virus v_(a)∈V, we used aone-tail t-test (Goulden, C. H., 1956) to determine if the mean of thesignal intensities of the probes ∈v_(a) was statistically higher thanthat of the signal intensities of the probes ∉v_(a). Thus, thet-statistic was computed:

$t_{i} = \frac{\mu_{a} - \mu_{a^{\prime}}}{\sqrt{\frac{\sigma_{a}^{2}}{n_{a}} + \frac{\sigma_{a^{\prime}}^{2}}{n_{a^{\prime}}}}}$

where μ_(a), σ_(a) ² and n_(a) is the mean, variance, and size of thesignal intensities of the probes ∈v_(a) respectively and μ_(a′), σ_(a′)², and n_(a′) is the mean, variance, and size of the signal intensitiesof the probes ∉v_(a) respectively.

To test the significance of the difference, the level of significancewas set to 0.05. This means that the hypothesis that the mean of thesignal intensities of the probes ∈v_(a) is higher than that of thesignal intensities of the probes ∉v_(a) would only be accepted if thep-value of t_(a)<0.05. In this case, v_(a) is likely to be present inthe sample.

The t-test alone, which allows the inventors to know if the distributionof the signal intensities of a virus is different from that of otherviruses, may not be sufficient to determine if a particular virus is inthe sample. It is also essential to know how similar or different thetwo distributions are. A ruler that can be used to measure thesimilarity between a true distribution and a model distribution is theKullback-Leiber divergence (Kullback and Leiber, 1951) (also known asthe relative entropy). In this application, the probability distributionof the signal intensities of the probes in v_(a) is the truedistribution while the probability distribution of the signalintensities of all the probes in P is the model distribution. Let P_(a)be the set of probes in v_(a). The Kullback-Leibler (KL) divergence ofthe probability distribution of the signal intensities of P_(a) and Pis:

${{KL}\left( P_{a}||P \right)} = {\sum\limits_{x \leq {\max {(D)}}}{{f_{a}(x)}{\log \left( \frac{f_{a}(x)}{f(x)} \right)}}}$

where μ is the mean signal intensity of the probes in P; f_(a)(x) is thefraction of probes in P_(a) with signal intensity x; and f(x) is thefraction of probes in P with signal intensity x. It follows that ifKL(P_(a)∥P)=0 then the probability distribution of P_(a) is exactly thesame as that of P. Otherwise they are different.

Since a virus that is present in the sample would have signalintensities higher than that of the population, this implies that v_(a)has a chance of being present in the sample if KL(P_(a)∥P)>0. Thus, thelarger the value of KL(P_(a)|P), the more different are the twoprobability distributions and the more likely that v_(a) is indeedpresent in the sample.

It is important to note that the Kullback-Leibler divergence is thecollective difference over all x of two probability distributions. Thus,while the Kullback-Leibler divergence is good at finding shifts in aprobability distribution, it is not always so good at finding spreads,which affect the tails of the probability distribution more. Asdescribed in FIG. 4(A,B), the tails of the probability distributionprovides the most information about whether a virus is present in thesample. Hence, the Kullback-Leibler divergence statistic must beimproved to reflect more accurately such an observation.

To increase its sensitivity out on the tails, we introduced a stabilizedor weighted statistic to the Kullback-Leibler divergence, theAnderson-Darling statistic (Stephens, M. A. (1974). EDF Statistics forGoodness of Fit and Some Comparisons, Journal of the AmericanStatistical Association, Vol. 69, pp. 730-737). Thus the WeightedKullback-Leibler divergence (WKL) is:

${{WKL}\left( P_{a}||P \right)} = {\sum\limits_{x \leq {\max {(D)}}}\frac{{f_{a}(x)}\log \frac{f_{a}(x)}{f(x)}}{\sqrt{{Q(x)}\left\lbrack {1 - {Q(x)}} \right\rbrack}}}$

where Q(x) is the cumulative distribution function of the signalintensities of the probes in P.

Empirical tests show that in samples where there are no viruses, virusesthat pass the t-test with significance level 0.05 have WKL<5.0. Insamples where there is indeed a virus present, the actual viruses notonly pass the t-test with significance level 0.05 but are also the onlyviruses to have WKL≧5.0. Thus we set the Weighted Kullback-Leiberdivergence threshold for a virus to be present in the sample to be 5.0.This analysis framework is shown in FIG. 5.

Apparatus and/or Product Performing the Method According to theInvention

It is well-known to a skilled person in the art how to configuresoftware which can perform the algorithms and/or methods provided in thepresent invention. Accordingly, the present invention also provides asoftware and/or a computer program product configured to person thealgorithms and/or methods according to any embodiment of the presentinvention There is also provided at least one electronic storage medium.The electronic storage medium may be a computer hard-drive, a CD-ROM, aflash memory device (e.g. USB thumbdrive), a floppy disk, or any otherelectronic storage medium in the art. The software may be run onpersonal computers, mainframes, and any computing processing unit, andthe particular configurations are known to a person skilled in the art.

It will be appreciated that the present invention has been described byway of example only and that various modifications in design may be madewithout departure from the spirit and scope of the invention.

Having now generally described the invention, the same will be morereadily understood through reference to the following examples, whichare provided by way of illustration, and are not intended to be limitingof the present invention.

EXAMPLES

Standard molecular biology techniques known in the art and notspecifically described were generally followed as described in Sambrookand Russel, Molecular Cloning: A Laboratory Manual, Cold Springs HarborLaboratory, New York (2001).

Microarray Synthesis

We selected 35 viral genomes representing the most common causes ofviral disease in Singapore (see Table 1 above).

Complete genome sequences were downloaded from NCBI Taxonomy Database(http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/) to generate40-mer probe sequences tiled across the entire genomes and overlappingat an average 8-base resolution. 7 replicates of each virus probe wassynthesized directly onto the microarray using Nimblegen proprietarytechnology (Nuwaysir et al. 2002). The probes were randomly distributedon the microarray to minimize the effects of hybridization artifacts. Tocontrol for non-specific hybridization of sample to probes and measurebackground signal, 10,000 oligonucleotide probes were designed andsynthesized onto the microarray. They are random probes with 40-60%GC-content with no sequence similarity to the human genome, or to thepathogen genomes. As a positive control, 400 oligonucleotide probes tohuman genes which have known or inferred functions in immune responsewere synthesized on the array. A plant virus, PMMV, was included asnegative control, for a total of 390,482 probes.

Sample Preparation, Microarray Hybridization and Staining

Dengue cell line (ATCC #VR-1254) was cultured as per ATCCrecommendations and Sin850 SARS cell line was cultured as described(Vega et al. 2004). Clinical specimens (nasopharyngeal washes) wereobtained from an Indonesian pediatric population and stored at −80° C.in RNAzol (Leedo Medical Laboratories, Inc., Friendswood, Tex.). Allwere suspected pneumonia patients aged between 7 to 38 mthsdemonstrating specific clinical signs of respiratory illnesses. RNA wasextracted with RNAzol according to manufacturer's instructions (Smallinget al. 2002; Tang et al. 1999). Extracted RNA was resuspended in RNAstorage solution (Ambion, USA) and stored at −80′C until needed. RNA wasreverse transcribed to c DNA using tagged random primers, based on aprotocol described by Bohlander et al and Wang et al (Wang et al. 2002;Bohlander et al. 1992). The cDNA was then amplified by random PCR,fragmented, end-labeled with biotin labeling, hybridized onto themicroarray and stained as previously described (Wong et al. 2002). Inour initial experiments, we found that probe GC content could createartifacts in signal intensity measurements, with increasing signaldirectly proportional to probe GC content. Adding 0.82 M TMAC toNimblegen's proprietary TMAC hybridization buffer eliminated thisartifact.

Real-Time Diagnostic RT-PCR for RSV and hMPV

A 20 μl reaction mixture containing 2 μl of the purified patient RNA, 5Uof MuLV reverse transcriptase, 8U of recombinant RNase inhibitor, 10 μlof 2× universal PCR Master Mix with no UNG (all from Applied Biosystems)0.9 μM primer and 0.2 μM probe. The real-time RT-PCRs were carried outin an ABI Prism 7900HT Sequence Detection System (Applied Biosystems).RT was performed at 48° C. for 30 min followed by 10 min at 95° C. foractivation of DNA polymerase. Amplification of RT products achieved by40 cycles of 15 s at 95° C. and 1 min at 60° C. Negative controls andserial dilutions of a plasmid clone (positive control) were included inevery PCR assay. During amplification, fluorescence emissions weremonitored at every thermal cycle. The threshold (CT) represents thecycle at which significant fluorescence is first detected. CT value wasconverted to copy number using a control plasmid of known concentration.For RSV, 2.61×10⁹ copies had a CT value of 11.897 while for hMPV,7.51×10⁹ copies had a CT value of 10.51.

1-step Diagnostic RT-PCR for Coronavirus and Rhinovirus

Frozen live cultures of human coronavirus OC43, 229E and rhinovirus 16were purchased from ATCC (Cat #VR-1558, VR-740, VR-283) for use aspositive controls. RNA was extracted from these cultures using RNA MiniKit (Qiagen, Germany) in accordance with manufacturer's instructions.The samples were amplified as previously described using the followingdiagnostic primer pairs: pancoronavirus (Cor-FW, Cor-RV), OC43 (OC43-FW,OC43-RV), 229E (229E-FW, 229E-RV), rhinovirus (Amplimer 1, Amplimer 2)(Moës et al. 2005; Deffernez et al. 2004).

Analysis of Pathogen Microarray Data

Our Pathogen Microarray contains a set of 40-mer probes P={p₁, p₂, . . ., p_(s)}, binned into distinct probe hybridization signatures for 35viral genomes V={v₁, v₂, . . . , v₃₅}. Upon hybridization of pathogennucleic acids, a set of probe signal intensity data D={d₁, d₂, . . . ,d_(s)} corresponding to probe set P is generated.

1-Tail T-Test

If virus v_(a) is present, then probes comprising its hybridizationsignature (probes ∈v₈) should have statistically higher signalintensities than probes ∉v_(a) determined by the t-statistic (1-tailT-test):

$t_{i} = \frac{\mu_{a} - \mu_{a^{\prime}}}{\sqrt{\frac{\sigma_{a}^{2}}{n_{a}} + \frac{\sigma_{a^{\prime}}^{2}}{n_{a^{\prime}}}}}$

where μ_(a), σ_(a) ² and n_(a) are the mean, variance, and size of thesignal intensities of the probes ∈v_(a) respectively and μ_(a′), andn_(a′), are the mean, variance, and size of the signal intensities ofthe probes ∉v_(a) respectively.

The level of significance was set to 0.05. This means that we would onlyaccept the hypothesis that the mean of the signal intensities of theprobes ∈v_(a) is higher than that of the signal intensities of theprobes ∉v_(a) if the p-value of t_(a)<0.05. In this case, v_(a) islikely to be present in the sample. However, the T-test method ofdetection results in many false positive calls.

PDA v.1

PDA v.1 comprises a series of statistical tests, beginning with aWeighted Kullback-Leibler test and Z-score transformation (WKL score)followed by Anderson-Darling test for normality.

Consider the virus v_(a). Let P_(a) be the set of probes of a virusv_(a) and P_(a) =P−P_(a). Let [r_(low), r_(high)] be the signalintensity range. We partitioned it into c bins

$\left\lbrack {{r_{low} + {j\left( \frac{r_{high} - r_{low}}{c} \right)}},{r_{low} + {\left( {j + 1} \right)\left( \frac{r_{high} - r_{low}}{c} \right)}}} \right\rbrack$

for j=0, 1, . . . , c−1. The unmodified Kullback-Leibler divergence maybe computed by

${{KL}\left( P_{a} \middle| \overset{\_}{P_{a}} \right)} = {\sum\limits_{j = 0}^{c - 1}\; {{f_{a}(j)}{\log \left( \frac{f_{a}(j)}{f_{\overset{\_}{a}}(j)} \right)}}}$

where n_(a) ^(j) and n_(ā) ^(j) are the number of probes in P_(a) andprobes in P_(a) contained in the bin b_(j) respectively.

${f_{a}(j)} = \frac{n_{a}^{j}}{\sum\limits_{h = 0}^{c - 1}\; n_{a}^{h}}$

is the fraction of probes in P_(a) found in bin b_(j); and

${f_{\overset{\_}{a}}(j)} = \frac{n_{\overset{\_}{a}}^{j}}{\sum\limits_{h = 0}^{c - 1}\; n_{\overset{\_}{a}}^{h}}$

is the fraction of probes in P_(a) found in bin b_(j).

To compare the signal difference of the tail of the probabilitydistribution, we set r_(low)= μ_(a) , the mean signal intensity of theprobes in P_(a) , and r_(high)=maximum signal intensity. We set thedefault number of bins, c=20.

To further stabilize and/or increase the sensitivity of theKullback-Leibler divergence on the tail of the probability distribution,two modifications were made. First, we introduced the Anderson-Darlingtype weight function to the Kullback-Leibler divergence. This gave moreweight to the tails than the middle of the distribution. Next, weapplied the statistic over the two corresponding cumulative distributionfunctions instead of their probability density functions. We call ourimproved Kullback-Leibler divergence the Weighted Kullback-Leiblerdivergence (WKL score):

${{WKL}\left( P_{a} \middle| \overset{\_}{P_{a}} \right)} = {\sum\limits_{j = 0}^{k - 1}\; \frac{{Q_{a}(j)}{\log \left( \frac{Q_{a}(j)}{Q_{\overset{\_}{a}}(j)} \right)}}{\sqrt{{Q_{\overset{\_}{a}}(j)}\left\lbrack {1 - {Q_{\overset{\_}{a}}(j)}} \right\rbrack}}}$

where Q_(a)(j) is the cumulative distribution function of the signalintensities of the probes in P_(a) found in bin b_(j); Q_(ā)(j) is thecumulative distribution function of the signal intensities of the probesin P_(a) found in bin b_(j).

Thus for each hybridized sample, we computed the WKL score of everyvirus v_(a)∈V. Next, we claimed that the distribution of WKL scores ofall viruses v_(a)∈V was approximately normal if there was no viruspresent in a sample. We empirically verified if our claim was correct bya bootstrapping process: Let n be the number of viruses in V. For eachvirus v_(k)∈V where k=1, n, we choose |v_(k)| probe signal intensitiesfrom a real dataset D randomly with replacement to form a “perturbed”signal intensity distribution of v_(k). Such distribution can mimic thesituation where virus v_(k) is not present in the sample D. Thereafter,n WKL scores are generated for the set of n viruses. Next, we checked ifthe n WKL scores follow a normal distribution by the Anderson-Darlingtest for normality at 95% confidence interval. The bootstrap wasrepeated 100,000 times. The distribution was found to be normal in morethan 99% of the time. (NB: since there are 35 viral genomes representedon our microarray, n=35)

Based on the above discussion, we can test if a sample containsvirus(es) by making the following null and alternative hypothesis:

H₀: The distribution of WKL scores is normal, i.e. viruses are notpresent in the sample.H₁: The distribution of WKL scores is not normal, i.e. at least 1 virusis present in the sample.

DEFINITION

The Anderson-Darling test is defined as:

-   -   H₀: The data follow a specified distribution.    -   H_(a): The data do not follow the specified distribution    -   Test The Anderson-Darling test statistic is defined as Statistic

A² = −N − S where$S = {\sum\limits_{i = 1}^{N}\; {\frac{\left( {{2i} - 1} \right)}{N}\left\lbrack {{\ln \mspace{14mu} {F\left( Y_{i} \right)}} + {\ln \mspace{14mu} \left( {1 - {F\left( Y_{N + 1 - i} \right)}} \right)}} \right\rbrack}}$

-   -   -   F is the cumulative distribution function of the specified            distribution. Note that the Y_(i) are the ordered data.

    -   Significance α

    -   Level:

    -   Critical The critical values for the Anderson-Darling test

    -   Region: are dependent on the specific distribution that is being        tested. Tabulated values and formulas have been published        (Stephens, 1974, 1976, 1977, 1979) for a few specific        distributions (normal, lognormal, exponential, Weibull,        logistic, extreme value type 1). The test is a one-sided test        and the hypothesis that the distribution is of a specific form        is rejected if the test statistic, A, is greater than the        critical value.

We proceed to apply the Anderson-Darling test for normality on thedistribution of WKL scores to reject H₀ with 95% confidence interval. Ifthe distribution of WKL scores is not normal, then we exclude the viruswith the outlying WKL score and apply the Anderson-Darling test again.This process is repeated (to identify the presence of co-infectingpathogens) until H₀ is accepted.

We denote the distribution of WKL score when H₀ is accepted as thebackground WKL distribution. The viruses excluded are thus very likelyto be present in the sample since their WKL score does not follow thebackground WKL distribution.

In our experiments, we observed that P, the probability that anon-normal distribution occurring by random chance with a given WKLscore, in samples which contain a virus is very low i.e. P<1.0×10⁻⁶(obtained via Z-score transformation of WKL score). Box 1 shows thepseudo-code for our virus-detection algorithm.

Box 1: Virus detection algorithm

Given a pathogen microarray data D with virus set V and probe set P, LetV_(present) = Φ Let D_(WKL) be the set of WKL(P_(v) || P _(v) ) for allv ∈ V ; 1. Determine normality of D_(WKL) with Anderson Darling test fornormality. If D_(WKL) is a normal distribution with significance level0.05, return V_(present). Else, go to step 2. 2. Find the virus v_(a)with the highest WKL(P_(a) || P_(a′)) from D_(WKL). Let V_(present) =V_(present) ♦ { v_(a) }; D_(WKL) = D_(WKL) − { WKL(P_(a) || P_(a′)) };Go to step 1. 3. Remove detected SPS and verify that WKL distribution isnormal. 4. If distribution is not normal, go back to step 2 to findco-infecting pathogen.

Predicting Genome-Wide Amplification Bias

Random primer amplification, rather than primer-specific amplificationis preferred for identifying unknown pathogens in clinical specimens.However, in initial experiments using random priming amplification toidentify known pathogens, we frequently observed incompletehybridizations spanning genomic regions not explained by sequencepolymorphisms (FIG. 7C) Genome secondary structure, probe secondarystructure and probe GC content also failed to explain these low signalintensity probes. Thus, we hypothesized that incomplete hybridizationmight owe to PCR bias stemming from differential abilities of the randomprimers to bind to the viral genome at the reverse transcription (RT)step. The random primer used in our experiments was a 26-mer comprisedof a random nonamer (3′) tagged with a fixed 17-mer sequence(5′-GTTTCCCAGTCACGATA) (SEQ ID NO:1)(see also FIG. 1), where the purposeof the fixed 5′ tag was to facilitate PCR of the RT product, generatingPCR fragments of less than 10000 bp, in particular 500-1000 bp PCRfragments (Pang et al. 2005; Wang et al. 2002; Wang et al. 2003). Tostudy this phenomenon, we designed an algorithm (AES) to model theRT-PCR process using experimental data. Successful RT-PCR is dependenton the ability of primers to bind to template. Intra-primer secondarystructure formation, such as dimer and hairpin formation between totemplate. Intra-primer secondary structure formation, such as dimer andhairpin formation between the tag and nonamer, and probe meltingtemperature are known to influence binding efficiency (Nguyen andSouthern, 2000; Ratushna et al. 2005).

Assuming that a nonamer in the random primer mix complements thesequence of the viral genome perfectly, the algorithm determines theprobability that a 500-1000 bp product can be generated from eachpossible starting position in the genome. Thus, for every nucleotide ina sliding window of 1000 bases, the probability that it will besuccessfully amplified is reflected in its Amplification EfficiencyScore (AES; See Amplification Efficiency Score above). To validate thealgorithm, we ranked the hybridization signal intensities for all 1,948SPS probes for the RSV genome and compared them to their AES values.Across the RSV genome, we observed that AES correlates remarkably wellto hybridization signal intensities (Fisher's Exact Probability TestP=2.2×10⁻¹⁶) demonstrating the strong correlation between AES and probedetection (FIG. 12). Another comparison using 1,705 SPS probes formetapneumovirus showed a similar result, P=1.3×10⁻⁹. The importance ofAES in predicting SPS probe detection in clinical samples isdemonstrated in FIG. 10. Notably, we observed that higher values of AEScorrelated with greater proportions of detectable probes, particularlyin the top 20% of AES values. Therefore, while HD, MCM, % GC andsequence uniqueness are valuable parameters of probe performance, theydo not take into account PCR bias, and thus are insufficient predictorsof probe performance when considered in the absence AES. Using top 20thpercentile AES as the first filter in the selection of pathogen SPSsignificantly improved pathogen prediction as evidenced by higher WKLscores and elimination of false-positive calls (Table 3).

TABLE 3 Detecting pathogens using only mean probe signal intensities(T-test) results in high number of false-positive calls. Optimizedhybridization signatures and removal of probes which cross-hybridize tohuman genome (filtered) reduces false-positive calls but is notsufficient for detection accuracy. PDA v.1 is able to make an accuratediagnosis using the entire unfiltered probe set. A virus is “detected”if WKL score >5. Using optimized hybridization signatures (filtered)increases the WKL score, corresponding to increased confidence of thediagnosis. Detection using PDA v.1 Max WKL Max WKL No. of Virus score(no score viruses CT Virus copy Chip # Pathogen filters) (filtered)Detected Value no. 32272 Pure SARS 5.007 5.803 1 — — 34959 Pure Dengue14.351 20.373 1 — — 35259 RSV patient 324 18.288 20.611 1 21.4366 9.8 ×10⁷ 35179 hMPV patient 122 1.747 8.439 1 25.5388 50384 35253 RSV patient841 12.056 12.069 1 20.8619  14 × 10⁷ 36042 RSV patient 412 16.46617.531 1 23.5804 2.5 × 10⁷ 36053 RSV patient 483 12.089 12.168 1 24.83401.2 × 10⁷ 35915 non-pneumonia 3.916 4.284 0 0   0 patient (negativecontrol) Virus CT value: the real-time PCR cycle when virus was detected(see above).

Data for all patient samples hybridized on the array are shown in Table4 below.

TABLE 4 Complete list of clinical patients hybridized onto pathogenmicroarrays. Initial Patient P- PDA v.1 Clinical PCR PCR CT Virus RT-PCRArray ID WKL value diagnosis diagnosis* diagnosis value copy no. Primer35179 122 8.439216 1.34 × 10⁻⁷¹ hMPV LRTI hMPV 24.8 5.0 × 10⁴ A1 35887122 18.312077 2.98 × 10⁻²² hMPV LRTI hMPV 24.8 5.0 × 10⁴ A2 71180 13317.359597 2.42 × 10⁻³⁷ hMPV LRTI hMPV 25.1159 4.0 × 10⁴ A2 66691 1658.56786 1.84 × 10⁻⁴ hMPV pneumonia hMPV 27.9 3.9 × 10³ A2 70935 25421.348515 8.70 × 10⁻³⁰ hMPV LRTI hMPV 21.9518 5.4 × 10⁵ A2 63781 28316.680752 3.97 × 10⁻¹² hMPV pneumonia unknown A2 73067 769 24.0063231.34 × 10⁻⁵¹ hMPV LRTI hMPV 25.6715 2.5 × 10⁴ A2 66690 853 nonepneumonia hMPV 36    0.5 A2 detected 68359 892 12.534284 5.66 × 10⁻⁵Rhinovirus pneumonia hMPV 33.8  27 A2 genus 35915 111 none Negative ctrlNone A1 detected 70927 818 none Negative ctrl None A2 detected 66701 312none pneumonia RSV A 33.7  44 A2 detected 71006 321 none pneumonia RSV A31.1 340 A2 detected 66702 368 none pneumonia unknown A2 detected 71025414 25.406289 3.80 × 10⁻²⁴ RSV B pneumonia RSV A 22.3 3.9 × 10⁵ A2 71027478 none pneumonia RSV A 34.8  18 A2 detected 73068 832 59.275233 1.91 ×10⁻¹⁰² RSV LRTI RSV A 23.7681 1.2 × 10⁵ A2 genus 71028 913 25.8970843.23 × 10⁻³⁰ RSV B pneumonia RSV A 19.1 4.7 × 10⁶ A2 66703 924 12.6731499.71 × 10⁻⁶ RSV pneumonia RSV A 31.5 250 A2 genus 35259 324 20.611473.55 × 10⁻⁹⁴ RSV B LRTI RSV B 21.4366 3.0 × 10⁶ A1 35662 355 17.9994182.97 × 10⁻⁴⁰ RSV B LRTI RSV B 20.2642 6.7 × 10⁶ A1 66695 374 nonepneumonia RSV B 34.1 500 A2 detected 70933 378 13.81578 7.77 × 10⁻¹⁷ RSVB LRTI RSV B 23.9204 5.4 × 10⁵ A2 36042 412 17.531234 4.58 × 10⁻⁵⁵ RSV BLRTI RSV B 23.5804 6.9 × 10⁵ A1 35890 412 17.214556 1.05 × 10⁻⁴³ RSV BLRTI RSV B 23.5804 6.9 × 10⁵ A2 + A3 36053 483 12.168025 1.47 × 10⁻¹²RSV B LRTI RSV B 24.834 2.9 × 10⁵ A1 70997 554 76.547183 1.83 × 10⁻¹¹⁹Rhinovirus pneumonia RSV B 35.1 240 A2 54.013223 2.45 × 10⁻⁶¹ genus;Enteroviridae family 35253 841 12.069138 4.86 × 10⁻²⁶ RSV B pneumoniaRSV B 20.8619 4.4 × 10⁶ A1 73070 841 22.10857 6.80 × 10⁻⁵⁰ RSV B,pneumonia RSV B/ 20.8619 4.4 × 10⁶ A2 5.708560 5.66 × 10⁻⁶ hMPV hMPV35.4  8 coinfection 68360 841 21.369516 2.09 × 10⁻²⁵ RSV B, pneumoniaRSV B/ 20.8619 4.4 × 10⁶ A2 9.647188 1.23 × 10⁻⁸ hMPV hMPV 35.4  8coinfection 66696 185 none pneumonia unknown A2 detected 66697 261 nonepneumonia unknown A2 detected 66698 331 none pneumonia unknown A2detected 71189 393 none pneumonia unknown A2 detected 66699 461 nonepneumonia unknown A2 detected 66700 573 41.397051 3.97 × 10⁻²³Rhinovirus pneumonia unknown A2 27.444893 1.34 × 10⁻¹¹ genus;Enteroviridae family 71182 639 none pneumonia unknown A2 detected 71007699 none pneumonia unknown A2 detected 71188 859 none pneumonia unknownA2 detected *LRTI: lower respiratory tract infection

The importance of AES suggested that amplification efficiency andsubsequent probe detection could be improved by using optimized RT-PCRprimer tags. Thus, we calculated AES scores using randomly generated17-mer tag sequences, and selected the top 3 most divergent primerswhich resulted in the greatest overall increase in AES scores (FIG. 13).Using the AES optimized primers, we amplified metapneumovirus and RSVfrom clinical samples with improved PCR efficiency and detectionsensitivity (FIG. 14, Table 5)

TABLE 5 Comparison of E-Predict and PDA v.1 algorithms on patientsamples #412 and #122. Array 35179 was amplified using the original PCRprimer described in Results. Arrays 36731 and 35887 were amplified usingprimer A2, and Array 35890 was amplified using both primers A2 and A3.PDA v.1 returned only the correct pathogen in all cases. The authors ofE-Predict use P < 0.01 as significance cutoff on their platform (Urismanet al. 2005). A lower cutoff appears to be necessary if this algorithmis used to analyze our array data. The new primers designed by PCRmodeling result in better prediction scores using either algorithms(arrays 35179 vs 35887). Having a second primer during the PCR processoffered incremental improvement in WKL scores and P-values (arrays 36731vs 35890). PCR GISPathogen amplification E-Predict algorithm algorithmArray Patient primers Genome Similarity_Score P-value Genome WKL 36042412 (RSV) Original RSV 0.35128 0 RSV 21.526316 primer A1 OC43coronavirus 0.350264 6.84E−20 229E coronavirus 0.323503 1.77E−10Hepatitis B 0.134825 3.03E−04 SARS coronavirus 0.338911 0.00299Hepatitis A 0.229589 0.00847 36731 412 (RSV) A2 RSV 0.335389 0 RSV21.836754 OC43 coronavirus 0.348043 2.29E−13 229E coronavirus 0.3220552.00E−09 Hepatitis B 0.135222 1.02E−06 Rubella 0.164332 0.00919 35890412 (RSV) A2 + A3 RSV 0.334602 0 RSV 22.093258 OC43 coronavirus 0.3489693.63E−23 229E coronavirus 0.322805 3.20E−14 Hepatitis B 0.13436 6.74E−04SARS coronavirus 0.338609 0.03060 35179 122 (hMPV) Original hMPV0.260110695 5.01E−28 hMPV 9.763149 primer A1 Rubella 0.1647849811.20E−17 Foot-and-mouth C 0.206747816 4.66E−11 Jap encephalitis0.201347222 1.65E−04 Hepatitis B 0.133407622 1.98E−04 Yellow Fever0.200500564 0.00567 Echovirus 1 0.222002025 0.01740 Newcastle0.234481686 0.01820 35887 122 (hMPV) A2 hMPV 0.299655 0 hMPV 39.677149Rubella 0.169626 3.40E−19 Hepatitis B 0.137703 5.84E−12 OC43 coronavirus0.347685 5.06E−10 229E coronavirus 0.321702 1.72E−06 SARS coronavirus0.340504 1.76E−06 Foot-and-mouth C 0.2075 1.31E−04 Newcastle 0.234530.04310

PDA v.1—an Algorithm for Detecting Pathogens

Clinical specimens are often sub-optimal for genomic amplification: theymay have low viral titres, have sequence polymorphisms from thereference strain on the array, or have co-infecting pathogens.Microarrays also have an inherent noise from non-specific hybridizationand other artifacts. Thus, interpreting microarray data is not a simplematter of matching probe signal intensity profiles to the SPS, or usingsimple statistical methods (e.g. T-test, ANOVA, and the like). Toaddress this issue, we established a robust statistical software, PDAv.1, which analyzes the distribution of probe signal intensitiesrelative to the in silico predicted SPS to identify pathogens present ina hybridized sample (See above).

Based on our observations that while the signal intensities for allprobes on the array would fall in a normal distribution, a largeproportion of probes comprising a pathogen SPS which is present in thesample would have very strong signal intensities resulting in adistribution skewed to the right; we deduced that we could detect thepresence of pathogens by analyzing the distribution of probe signalintensities (FIG. 9A). Examining the tails of the signal intensitydistributions for each SPS would also enable us to identify the presenceof co-infecting pathogens in the sample.

Thus, PDA v.1 comprises 2 parts: (1) Weighted Kullback-LeiblerDivergence (WKL; our enhanced Kullback-Leibler test) to evaluate theprobe signal intensity of probes in each pathogen SPS, and (2) anAnderson-Darling test to determine if the distribution of WKL scores foreach SPS is normal.

The original Kullback-Leibler cannot reliably determine differences inthe tails of a probability distribution, and is highly dependent on thenumber of probes/genome and the size of each signal intensity bin(Kullback and Leibler, 1951). We overcame these deficits byincorporating the Anderson-Darling statistic to give more weight to thetails of each distribution, and by using a cumulative distributionfunction instead of the original probability distribution (Anderson andDarling, 1952). We call our enhanced KL divergence the WeightedKullback-Leibler divergence (WKL):

${{WKL}\left( P_{a} \middle| \overset{\_}{P_{a}} \right)} = {\sum\limits_{j = 0}^{k - 1}\; \frac{{Q_{a}(j)}{\log \left( \frac{Q_{a}(j)}{Q_{\overset{\_}{a}}(j)} \right)}}{\sqrt{{Q_{\overset{\_}{a}}(j)}\left\lbrack {1 - {Q_{\overset{\_}{a}}(j)}} \right\rbrack}}}$

where Q_(a)(j) is the cumulative distribution function of the signalintensities of the probes in P_(a) found in bin b_(j); Q_(ā)(j) is thecumulative distribution function of the signal intensities of the probesin P_(a) found in bin b_(j). SPS representing absent pathogens shouldhave normal signal intensity distributions and thus relatively low WKLscores, whereas those representing present pathogens should have high,statistically significant outlying WKL scores (FIG. 9B). In the secondpart of PDA v.1, the distribution of WKL scores is subjected to anAnderson-Darling test for normality. If P<0.05, the WKL distribution isconsidered not normal, implying that the pathogens with outlying WKLscore is present. Upon identification of a pathogen, a separateAnderson-Darling test is performed in the absence of its WKL score totest for the presence of co-infecting pathogens. In this manner, theprocedure is iteratively applied until only normal distributions remain(i.e., P>0.05; see Table 3 and Table 4 above). PDA v.1 is extremelyfast, capable of making a diagnosis from a hybridized microarray inabout 10 secs.

Pathogen Diagnosis on 33 Clinical Patient Samples

We evaluated our platform by hybridizing 33 clinical specimens onto ourpathogen microarray platform, according to the workflow illustrated inFIG. 11. Of these, 27 specimens had been previously diagnosed as RSV A,RSV B or metapneumovirus. Our platform accurately detected pathogensfrom 21/27 samples. The 6 samples where no virus was detected(false-negative) were at the detection limit by real-time PCR (<10 viralcopies/reaction), and such low viral loads were unlikely to be theetiologic agent responsible for the patient's severe disease. 2 of thesewere correctly diagnosed by microarray to be infected with rhinovirus.In a screen of another 6 patients with severe respiratory disease causedby unknown pathogen, the microarray identified the etiologic agent(rhinovirus) in 1 of the samples (Table 4 above). These results werevalidated by real-time PCR. As expected, we did not detect any pathogenswhen we hybridized samples extracted from pneumonia patients withnon-viral etiology.

Data Analysis

Microarrays were scanned at 5 μm resolution using the Axon 4000b scannerand Genepix 4 software (Axon Instruments). Signal intensities wereextracted using Nimblescan 2.1 software (NimbleGen Systems). Using anautomated script, we calculated the median signal intensity (toeliminate hybridization artifacts) and standard deviation from the 7replicates of each probe. The probe signal intensities were sorted bygenome and arranged in sequence order, then reformatted into CDT formatfor graphical viewing of signal intensities in Java Treeview(http://jtreeview.sourceforge.net). In parallel, the probe median signalintensities were analysed using PDA v.1 to determine which pathogen ispresent, and associated confidence level of prediction. The presentinventors carried out experiments to demonstrate the effects of probedesign on experimental results and then to show the robustness of theanalysis algorithm according to the present invention.

Effects of Probe Design on Experimental Results

A PDC containing 53555 40-mer probes from 35 viruses affecting human wasused for 4 independent microarray experiments. These 53555 probes werechosen based on a 5-bps tiling of each virus and were not subjected toany of our probe design criteria. Thus, we would expect errors arisingdue to CG-content, cross-hybridization and inefficient amplification tobe significantly more than that of a PDC with well-designed probes. Wetested our analysis algorithm in such an adverse setting for 4experiments.

In this example, a human sample with an unknown pathogen was amplifiedby the RT-PCR process using random probes and then hybridized onto thePDC. We subjected the probes for each of the 35 viruses on our PDC tothe one-tailed t-test with significance level 0.05 and computed theWeighted Kullback-Leibler (WKL) divergence of their signal intensitiesto the signal intensities of all the probes on the chip to determinewhich virus was in the sample for each experiment. Confirmation of theaccuracy of the analysis by our program was done by wet-lab PCR toidentify the actual virus in the sample. We present the results of ouranalysis for the 4 experiments of Table 6 and their corresponding PCRverifications in Table 6.

TABLE 6 Analysis results done on a PDC with no probe design criteriaapplied.

The virus determined by our analysis algorithm to be the actual virus inthe sample tested for each experiment is highlighted in light graycolour.

The present results show that the analysis algorithm accurately deducesthe actual virus in the sample tested in the first 3 experiments(results shown in Table 6 above). Furthermore, we were able to deducethat the sample has no viruses in the last experiment. Note that if wehad just used the t-test with level of significance 0.05, then thenumber of viruses detected to be present for each sample is shown inTable 7 below.

TABLE 7 False positive detection of viruses using t-test alone SampleName 35259_324 35179_122 35253_841 35915_111 Viruses 9 14 9 10 DetectedUsing T-test False Positives 8 13 8 10 Max KL 16.391 5.76 10.85 —divergence (>5.0) Viruses 1 1 1  0 Detected Using T-test followed by KLdivergence

By using the Weighted Kullback-Leibler divergence of the viruses thatpass the t-test, we were able to remove all false positive viruses andidentify the actual virus. Thus, our analysis algorithm can robustlydetermine the virus under a high level of noise.

Next, we investigated the effects of using a PDC with probe designcriteria applied on our analysis results. Firstly, the amplificationefficiency map for each of the 35 viruses was computed. Then, the exact53555 probes on the original PDC were subjected to probe designcriteria. Probes which had extreme levels of CG-content, high similarityto human and non-target viruses, and low amplification efficiency scoreswere removed from the chip. A total of 10955 probes were retained forthe second set of experiments. Using the samples used in the first setof experiments, we repeated the 4 experiments in Table 8 below with thenew chip. The experimental results are presented in Table 8.

TABLE 8 Analysis results done on a PDC with probe design criteriaapplied.

The virus determined by our analysis algorithm to be the actual virus inthe sample tested for each experiment is highlighted in light graycolour.

In the following set of experiments, the analysis algorithm correctlydetected the actual virus in the 3 samples and also the negative sample.After designing good probes for our chip, the Weighted Kullback-Leiblerdivergence of the actual viruses in Experiment 1, 2 and 3 was greaterthan that of the corresponding experiments without probe design. Thismeans that the signal intensities from the actual virus were relativelyhigher than the background noise in the PDC. This showed that our probedesign criteria had removed some bad probes from the PDC, which resultedin a more accurate analysis.

Again, we present results of the 4 experiments shown in Table 9 below,if we had just used the t-test with a level of significance 0.05. Thistime, the number of viruses detected to be present for each sample isshown in Table 9:

TABLE 9 False positive detection of viruses using t-test alone in a PDCwith probe design. Sample Name 35259_324 35179_122 35253_841 35915_111Viruses 6 9 9 10 Detected Using T-test False Positives 5 8 8 10 Max KL18.54859 9.324785 11.17914 — divergence (>5.0) Viruses 1 1 1  0 DetectedUsing T-test followed by KL divergence

From Table 9, it can be seen that probe design has reduced the number offalse positive viruses detected by the t-test for samples 35259_(—)324and 35179_(—)122. A more important observation is that the WeightedKullback-Leiber divergence for the actual virus has increased for all 4samples. This means that the signals of the actual virus are moredifferentiated than the background signals when probe design criteriaare applied on the PDC.

In conclusion, we showed that using the one-tailed t-test withsignificance level 0.05, followed by computing the WeightedKullback-Leibler divergence for the signal intensities of each virus, wewere able to accurately analyze the data on the PDC and determine withhigh probability the actual pathogen in the sample. Although theanalysis algorithm works well even under a high level of noise, weshowed that the accuracy of the analysis is improved by using theabove-described probe design criteria to select a good set of probes forthe PDC.

Alternative Methods for Probe Design and Pathogen Detection

Very few algorithms are available for predicting cross-hybridization onmicroarrays and only 1 algorithm, E-predict, has been reported andvalidated for detecting pathogens on microarrays (Urisman et al. 2005;Li et al. 2005). E-predict matches hybridization signatures withpredicted signatures derived from the theoretical free energy ofhybridization for each microarray probe. However, using E-predict toanalyze our microarrays resulted in a number of false positive calls(see Table 5 above). For example, E-Predict detected coronavirus in RSVpatient 412 (FIG. 15). Diagnostic PCR using pancoronavirus primers aswell as specific diagnostic primers for OC43 and 229E coronavirusconfirmed the absence of coronavirus from patient 412 (see Table 4above). We hypothesized that false positive calls using E-Predictresulted from coronavirus probes which cross-hybridized with human orRSV genomes. Indeed, 85% of the 50 coronavirus probes with highestsignal intensity were predicted to cross-hybridize with human genome and65% had HD<17 relative to RSV, which is just above our HD threshold of12 for familial cross-hybridization. Furthermore, E-Predict wasoptimized to work on a microarray which contained probes that are highlyconserved among viral genomes regions instead of tiling arrays wherecross-hybridization to human genome would be a key consideration. Thusit is likely that these 2 factors—different microarray design strategyand cross-hybridization to human genome, contributed to the poorperformance of E-predict on our platform. From our experience withE-predict, it would not be fair for us to compare PDA v1 with the otheralgorithms as they were designed for different probe lengths andoptimized for other applications and platforms.

CONCLUSION

By empirically determining cross-hybridization thresholds, we created insilico pathogen signature probe sets comprising only probes which wouldhybridize well to specific viruses present in clinical samples. The AESalgorithm allowed us to design universal primer tags to efficientlyamplify entire viral genomes. Together with the PDA v.1 detectionalgorithm, we can confidently identify any of the pathogens representedon the microarray from clinical samples. This approach eliminates therequirement for empirical validation of each pathogen hybridizationsignature and allows for future microarrays containing probes for >10000pathogens to become powerful diagnostic platforms for pathogenidentification.

We have optimized the design and analysis for pathogen detectionmicroarrays, facilitating their use in a hospital setting. We discoveredthat primer tags routinely used in random PCR are biased, resulting innon-uniform amplification of pathogen genomes. This bias can be avoidedby designing primers using our AES algorithm. Our in silico signatureprobe sets allow us to predict accurately which probes would hybridizeto any pathogen represented on the array. Together with the PDA v.1detection algorithm, this approach eliminates the requirement forempirical validation of each pathogen hybridization signature and allowsfor future microarrays containing probes for >10000 pathogens to becomepowerful diagnostic platforms for pathogen identification.

Here, we report the results of a systematic investigation of the complexrelationships between viral amplification efficiency, hybridizationsignal output, target-probe annealing specificity, and reproducibilityof pathogen detection using a custom designed microarray platform. Ourfindings form the basis of a novel methodology for the in silicoprediction of optimal pathogen signature probe sets (SPS), shed light onthe factors governing viral amplification efficiency (prior tomicroarray hybridization) and demonstrate the important connectionbetween a viral amplification efficiency score (AES) and optimal probeselection. Finally, we describe a new statistics-based pathogendetection algorithm (PDA), that can rapidly and reproducibly identifypathogens in clinical specimens across a range of viral titers.

We have demonstrated the feasibility of using viral genome sequenceobtained from publicly available databases, to detect viruses inclinical samples with a high degree of certainty if at least 4000 viruscopies are present (see Table 3 above). Its sensitivity approaches thatof antigen detection methods, making it a clinically relevant detectiontool (Liu et al. 2005; Marra et al. 2003). The ability to predict insilico pathogen hybridization signatures accurately presents asignificant advance over current microarray methods, which requireempirical validation by first hybridizing the array with pure pathogensamples. Besides specific identification of pathogens represented on thearray, PDA v.1 allows identification of the pathogen class, family orgenus for those genomes which are not specifically represented on thearray (by relaxing thresholds for HD and MCM). This information is oftensufficient for treatment decisions in the clinic. With an AES-optimizedtag, we were able to identify virus from clinical samples which couldnot be detected earlier when amplified using a non-AES-optimized tag.Thus selection of tags by AES increased PCR efficiency and sensitivityof detection. The algorithm according to the invention may be applied toother tagged-based PCR applications, such as generation of DNA librariesand enrichment of RNA for resequencing.

REFERENCES

-   Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, et    al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein    database search programs. Nucleic Acids Res 25: 3389-3402.-   Anderson T W, Darling D A (1952) Asymptotic theory of certain    goodness of fit criteria based on stochastic processes. Annals of    Mathematical Statistic 23: 192-212.-   Bodrossy L, Sessitsch A (2004) Oligonucleotide microarrays in    microbial diagnostics. Curr Opin Microbiol 7: 245-254.-   Bohlander S K, Espinosa I, Rafael, Le Beau M M, Rowley J D, Diaz M    O (1992) A method for the rapid sequence-independent amplification    of microdissected chromosomal material. Genomics 13: 1322-1324.-   Bustin, S. A. & Nolan, T. (2004) Pitfalls of quantitative real-time    reverse-transcription polymerase chain reaction. J Biomol Tech 15,    155-166-   Deffernez C, Wunderli W, Thomas Y, Yerly S, Perrin L, et al. (2004)    Amplicon Sequencing and Improved Detection of Human Rhinovirus in    Respiratory Samples 10.1128/JCM.42.7.3212-3218.2004. J Clin    Microbiol 42: 3212-3218.-   Fu J, Tan B H, Yap E H, Chan Y C, Tan Y H (1992) Full-length cDNA    sequence of dengue type 1 virus (Singapore strain S275/90). Virology    188: 953-958.-   Goulden, C. H. Methods of Statistical Analysis, Edn. 2nd. (John    Wiley & Sons, Inc., New York; 1956).-   Hamming R W (1950) Error Detecting and Error Correcting Codes. Bell    System Technical Journal 29: 147-160.-   International Human Genome Sequencing Consortium. Initial sequencing    and analysis of the human genome. Nature 409(6822), 860-921 (2001).-   Kane M D, Jatkoe T A, Stumpf C R, Lu J, Thomas J D, et al. (2000)    Assessment of the sensitivity and specificity of oligonucleotide    (50mer) microarrays. Nucleic Acids Res 28: 4552-4557.-   Kane, M. D. et al. Assessment of the sensitivity and specificity of    oligonucleotide (50mer) microarrays. Nucleic Acids Res 28, 4552-4557    (2000).-   Ksiazek T G, Erdman D, Goldsmith C S, Zaki S R, Peret T, et    al. (2003) A novel coronavirus associated with severe acute    respiratory syndrome. N Engl J Med 348: 1953-1966.-   Kullback S, Leibler R A (1951) On information and sufficiency.    Annals of Mathematical Statistic 22: 79-86.-   Li X, He Z, Zhou J (2005) Selection of optimal oligonucleotide    probes for microarrays using multiple criteria, global alignment and    parameter estimation. Nucl Acids Res 33: 6114-6123.-   Liu J, Lim S L, Ruan Y, Ling A E, Ng L F, et al. (2005) SARS    transmission pattern in Singapore reassessed by viral sequence    variation analysis. PLoS Med 2(2), 162-168.-   Marra M A, Jones S J, Astell C R, Holt R A, Brooks-Wilson A, et    al. (2003) The Genome sequence of the SARS-associated coronavirus.    Science 300: 1399-1404.-   Maskos U, Southern E M (1993) A study of oligonucleotide    reassociation using large arrays of oligonucleotides synthesised on    a glass support. Nucleic Acids Res 21: 4663-4669.-   Moës E, Vijgen L, Keyaerts E, Zlateva K, Li S, et al. (2005) A novel    pancoronavirus RT-PCR assay: frequent detection of human coronavirus    NL63 in children hospitalized with respiratory tract infections in    Belgium. BMC Infect Dis 5: 6.-   Nguyen H K, Southern E M (2000) Minimising the secondary structure    of DNA targets by incorporation of a modified deoxynucleoside:    implications for nucleic acid analysis by hybridisation. Nucleic    Acids Res 28: 3904-3909.-   Nuwaysir E F, Huang W, Albert T J, Singh J, Nuwaysir K, et    al. (2002) Gene expression analysis using oligonucleotide arrays    produced by maskless photolithography. Genome Res 12: 1749-1755.-   Pang X L, Preiksaitis J K, Lee B (2005) Multiplex real time RT-PCR    for the detection and quantitation of norovirus genogroups I and II    in patients with acute gastroenteritis. J Clin Virol 33: 168-171.-   Ratushna V G, Weller J W, Gibas C J (2005) Secondary structure in    the target as a confounding factor in synthetic oligomer microarray    design. BMC Genomics 6: 31.-   Ruan Y J, Wei C L, Ee A L, Vega V B, Thoreau H, et al. (2003)    Comparative full-length genome sequence analysis of 14 SARS    coronavirus isolates and common mutations associated with putative    origins of infection. Lancet 361: 1779-1785.-   Sambrook and Russel, (2001) Molecular Cloning: A Laboratory Manual,    Cold Springs Harbor Laboratory, New York-   SantaLucia, J., Jr., Allawi, H. T. & Seneviratne, P. A. (1996)    Improved nearest-neighbor parameters for predicting DNA duplex    stability. Biochemistry 35, 3555-3562.-   Smalling T W, Sefers S E, Li H, Tang Y W (2002) Molecular approaches    to detecting herpes simplex virus and enteroviruses in the central    nervous system. J Clin Microbiol 40: 2317-2322.-   Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some    Comparisons, Journal of the American Statistical Association, Vol.    69, pp. 730-737.-   Striebel H M, Birch-Hirschfeld E, Egerer R, Foldes-Papp Z (2003)    Virus diagnostics on microarrays. Curr Pharm Biotechnol 4: 401-415.-   Sung, W. K. & Lee, W. H. Fast and Accurate Probe Selection Algorithm    for Large Genomes. CSB (2003).-   Sung, W. K. & Lee, W. H. (2003) in IEEE Computational Systems    Bioinformatics ConferenceStanford University, Stanford, Calif.-   Urisman A, Fischer K F, Chiu C Y, Kistler A L, Beck S, et al. (2005)    E-Predict: a computational strategy for species identification based    on observed DNA microarray hybridization patterns. Genome Biol 6:    R78.-   Vega V B, Ruan Y, Liu J, Lee W H, Wei C L, et al. (2004) Mutational    dynamics of the SARS coronavirus in cell culture and human    populations isolated in 2003. BMC Infect Dis 4: 32.-   Vora G J, Meador C E, Stenger D A, Andreadis J D (2004) Nucleic acid    amplification strategies for DNA microarray-based pathogen    detection. Appl Environ Microbiol 70: 3047-3054.-   Wang D, Coscoy L, Zylberberg M, Avila P C, Boushey H A, et    al. (2002) Microarray-based detection and genotyping of viral    pathogens. Proc Natl Acad Sci USA 99: 15687-15692.-   Wang D, Urisman A, Liu Y T, Springer M, Ksiazek T G, et al. (2003)    Viral discovery and sequence recovery using DNA microarrays. PLoS    Biol 1: E2.-   Wong C W, Albert T J, Vega V B, Norton J E, Cutler D J, et    al. (2004) Tracking the Evolution of the SARS Coronavirus Using    High-Throughput, High-Density Resequencing Arrays. Genome Res 14:    398-405.-   Wu, D. Y., Ugozzoli, L., Pal, B. K., Qian, J. &    Wallace, R. B. (1991) The effect of temperature and oligonucleotide    primer length on the specificity and efficiency of amplification by    the polymerase chain reaction. DNA Cell Biol 10, 233-238

1. A method of detecting at least one target nucleic acid comprising thesteps of: (i) providing at least one biological sample; (ii) amplifyingnucleic acid(s) comprised in the biological sample; (iii) providing atleast one oligonucleotide capable of hybridizing to at least one targetnucleic acid, if present in the biological sample; and (iv) contactingthe oligonucleotide(s) with the amplified nucleic acids and/or detectingthe oligonucleotide(s) hybridized to the target nucleic acid(s).
 2. Themethod according to claim 1, wherein the target nucleic acid to bedetected is nucleic acid exogenous to the nucleic acid of the biologicalsample.
 3. The method according to claim 1, wherein in the detectionstep (iv), the mean of the signal intensities of the probes whichhybridize to v_(a) is statistically higher than the mean of the probes∉v_(a), thereby indicating the presence of v_(a) in the biologicalsample.
 4. The method according to claim 1, wherein in the detectionstep (iv), the mean of the signal intensities of the probes whichhybridize to v_(a) is statistically higher than the mean of the probes∉v_(a), and the method further comprises the step of computing therelative difference of the proportion of probes ∉v_(a) having highsignal intensities to the proportion of the probes used in the detectionmethod having high signal intensities, the density distribution of thesignal intensities of probes v_(a) being more positively skewed thanthat of probes ∉v_(a), thereby indicating the presence of v_(a) in thebiological sample.
 5. The method according to claim 1, wherein in thedetection step (iv), the presence of at least one target nucleic acid ina biological sample is given by a value of Weighted Kullback-Leiblerdivergence score of 1.0.
 6. The method according to claim 1, wherein thedetection step (iv) comprises evaluating the signal intensity ofprobe(s) in each signature probe set (SPS) for the target nucleicacid(s) v_(a) by calculating the distribution of WeightedKullback-Leibler (WKL) divergence scores:${{WKL}\left( P_{a} \middle| \overset{\_}{P_{a}} \right)} = {\sum\limits_{j = 0}^{k - 1}\; \frac{{Q_{a}(j)}{\log \left( \frac{Q_{a}(j)}{Q_{\overset{\_}{a}}(j)} \right)}}{\sqrt{{Q_{\overset{\_}{a}}(j)}\left\lbrack {1 - {Q_{\overset{\_}{a}}(j)}} \right\rbrack}}}$where Q_(a)(j) is the cumulative distribution function of the signalintensities of the probes in P_(a) found in bin b_(j); Q_(ā)(j) is thecumulative distribution function of the signal intensities of the probesin P_(a) found in bin b_(j); P_(a) is the set of probes of a virus v_(a)and P_(a) =P−P_(a).
 7. The method according to claim 6, wherein eachsignature probe set (SPS) which represents the absence of target nucleicacid(s) v_(a) has a Weighted Kullback-Leibler (WKL) divergence score ofWKL<5, and wherein each signature probe set (SPS) which represents thepresence of at least one target nucleic acid v_(a) has a WeightedKullback-Leibler (WKL) divergence score of WKL>5.
 8. The methodaccording to claim 6, further comprising performing Anderson-Darlingtest on the distribution of WKL score(s), wherein a result of P>0.05thereby indicates the absence of target nucleic acid(s) v_(a) andwherein a result of P<0.05 thereby indicates the presence of targetnucleic acid(s) V_(a).
 9. An apparatus configured to perform a method ofdetecting at least one target nucleic acid comprising the steps of: (i)providing at least one biological sample; (ii) amplifying nucleicacid(s) comprised in the biological sample; (iii) providing at least oneoligonucleotide capable of hybridizing to at least one target nucleicacid, if present in the biological sample; and (iv) contacting theoligonucleotide(s) with the amplified nucleic acids and/or detecting theoligonucleotide(s) hybridized to the target nucleic acid(s); wherein thedetection step comprises evaluating the signal intensity of probe(s) ineach signature probe set (SPS) for the target nucleic acid(s) bycalculating the distribution of Weighted Kullback-Leibler (WKL)divergence scores:${{WKL}\left( P_{a} \middle| \overset{\_}{P_{a}} \right)} = {\sum\limits_{j = 0}^{k - 1}\; \frac{{Q_{a}(j)}{\log \left( \frac{Q_{a}(j)}{Q_{\overset{\_}{a}}(j)} \right)}}{\sqrt{{Q_{\overset{\_}{a}}(j)}\left\lbrack {1 - {Q_{\overset{\_}{a}}(j)}} \right\rbrack}}}$where Q_(a) (j) is the cumulative distribution function of the signalintensities of the probes in P_(a) found in bin b_(j); Q_(ā)(i) is thecumulative distribution function of the signal intensities of the probesin P_(a) found in bin b_(j); and where P_(a) is the set of probes of avirus v_(a) and P_(a) =P−P_(a).
 10. The apparatus according to claim 9,wherein the target nucleic acid to be detected is at least one nucleicacid exogenous to the nucleic acid of the biological sample.
 11. Theapparatus according to claim 9, wherein the presence of a target nucleicacid in a biological sample is given by a value of WeightedKullback-Leibler divergence of ≧1.0.
 12. The apparatus according toclaim 9, wherein each signature probe set (SPS) which represents theabsence of target nucleic acid(s) has a Weighted Kullback-Leibler (WKL)divergence score of WKL<5, and wherein each signature probe set (SPS)which represents the presence of at least one target nucleic acid hasand/or a Weighted Kullback-Leibler (WKL) divergence score of WKL>5. 13.The apparatus according to claim 9, further comprising performing anAnderson-Darling test on the distribution of WKL score(s), wherein aresult of P>0.05 thereby indicates the absence of target nucleic acid(s)and wherein a result of P<0.05 thereby indicates the presence of targetnucleic acid(s).
 14. A non-transitory electronic storage mediumcomprising a software with instructions to cause a computing processingunit to perform the method according to claim
 1. 15. A non-transitoryelectronic storage medium comprising a software with instructions tocause a computing processing unit to determine the WKL divergence scoreaccording to claim 6 and/or perform the Anderson Darling test accordingto claim
 8. 16. The method according to claim 8, wherein P<0.05indicates the distribution of WKL scores is not normal and P>0.05indicates the distribution of WKL scores is normal.
 17. The methodaccording to claim 16, wherein if the distribution of WKL scores is notnormal, the target nucleic acid molecule with the highest WKL score isidentified as present in the biological sample.
 18. The method accordingto claim 17, further comprising removing the highest WKL score from theWKL scores, and repeating the Anderson-Darling test on the remaining WKLscores to determine if the distribution of the remaining WKL scores isnormal.
 19. The method according to claim 18, wherein if thedistribution of the remaining WKL scores is not normal, the targetnucleic acid molecule with the next highest WKL score is also identifiedas present.
 20. The method according to claim 19, wherein the targetnucleic acid molecule with the next highest WKL score is indicative of aco-infecting pathogen.
 21. The method according to claim 19, comprisingrepeating the steps of removing the next highest WKL score and repeatingthe Anderson-Darling test until the distribution of the WKL scoresbecomes normal, thereby detecting the presence of any other targetnucleic acid molecules and/or co-infecting pathogens.